Omnidirectional Transfer for Quasilinear Lifelong Learning

. In biological learning, data are used to improve performance not only on the current task, but also on previously encountered and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa , using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artiﬁcial, should be to improve performance on all tasks (including past and future) with any new data. We propose omnidirectional transfer learning algorithms, which includes two special cases of interest: decision forests and deep networks. Our key insight is the development of the omni-voter layer, which ensembles representations learned independently on all tasks to jointly decide how to proceed on any given new data point, thereby improving performance on both past and future tasks. Our algorithms demonstrate omnidirectional transfer in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they do so with quasilinear space and time complexity.

1 Introduction Learning is the process by which an intelligent system improves performance on a given task by leveraging data [1]. In biological learning, learning is lifelong, with agents continually building on past knowledge and experiences, improving on many tasks given data associated with any task. For example, learning a second language often improves performance in an individual's native language [2]. In classical machine learning, the system often starts with essentially zero knowledge, a "tabula rasa", and is optimized for a single task [3,4]. While it is relatively easy to simultaneously optimize for multiple tasks (multi-task learning) [5], it has proven much more difficult to sequentially optimize for multiple tasks [6,7]. Specifically, classical machine learning systems, and natural extensions thereof, exhibit "catastrophic forgetting" when trained sequentially, meaning their performance on the prior tasks drops precipitously upon training on new tasks [8,9]. This is in contrast to many biological learning settings, such as the second language learning setting mentioned above.
In the past 30 years, a number of sequential task learning algorithms have attempted to overcome catastrophic forgetting. These approaches naturally fall into one of two camps. In one, the algorithm has fixed resources, and so must reallocate resources (essentially compressing representations) in order to incorporate new knowledge [10][11][12][13][14]. Biologically, this corresponds to adulthood, where brains have a nearly fixed or decreasing number of cells and synapses. In the other, the algorithm adds (or builds) resources as new data arrive [15][16][17]. Biologically, this corresponds to development, where brains grow by adding cells, synapses, etc.
Approaches from both camps demonstrate some degree of continual (or lifelong) learning [18]. In particular, they can sometimes learn new tasks while not catastrophically forgetting old tasks. However, as we will show, many state of the art lifelong learning algorithms are unable to transfer knowledge forward, and none are able to transfer knowledge backward with small sample sizes where it is particularly important. This inability to omnidirectionally transfer has been identified as one of the key obstacles limiting the capabilities of artificial intelligence [19,20].
We present an approach to lifelong learning called "omnidirectional learning". Omnidirectional learning algorithms build on the ideas introduced in Progressive Neural Networks (ProgNN) [16], in which new tasks yield additional representational capacity. However, although ProgNN's are able to transfer forwards, they fail to transfer backwards. Moreover, as we will show, ProgNN requires quadratic space and time complexity in sample size. Our key innovation is the introduction of representation ensembling which enables omnidirectional transfer via an "omni-voter" layer, reducing computational time and space from quadratic to quasilinear (i.e., linear up to polylog terms).
We implement two complementary omnidirectional learning algorithms, one based on decision forests (Omnidirectional Forests, Odif), and another based on deep networks (Omnidirectional Networks, Odin). Both Odif and Odin demonstrate forward and backward transfer, while maintaining computational efficiency. Simulations illustrate their learning capabilities, including performance properties in the presence of adversarial tasks. We then demonstrate their learning capabilities in vision and language benchmark applications. Although the omnidirectional algorithms presented here are primarily resource building, we illustrate that they can effectively leverage prior representations. This ability implies that the algorithm can convert from a "juvenile" resource building state to the "adult" resource recruiting state -all while maintaining key omnidirectional learning capabilities and efficiencies.
2 Background 2.1 Classical Machine Learning Classical supervised learning [21] considers random variables (X, Y ) ∼ P X,Y , where X is an X -valued input, Y is a Y-valued label (or response), and P X,Y ∈ P X,Y is the joint distribution of (X, Y ). Given a loss function ℓ : Y × Y → [0, ∞), the goal is to find the hypothesis (also called predictor or decision rule), h : X → Y that minimizes expected loss, or risk, . A learning algorithm (or rule) is a function f that maps data sets (n training samples) to a hypothesis, where a data set S n = {X i , Y i } n i=1 is a set of n input/response pairs. Assume n samples of (X, Y ) pairs are independently and identically distributed from some true but unknown P X,Y [21]. A learning algorithm is evaluated on its generalization error (or expected risk): E [R(f (S n ))] , where the expectation is taken with respect to the true but unknown distribution governing the data, P X,Y . The goal is to choose a learner f that learns a hypothesis h that has a small generalization error for the given task [22].

Lifelong Learning
Lifelong learning generalizes classical machine learning in a few ways: (i) instead of one task, there is an environment T of (possibly infinitely) many tasks, (ii) data arrive sequentially, rather than in batch mode, and (iii) there are computational complexity constraints on the learning algorithm and hypotheses. This third requirement is crucial, though often implicit. Consider, for example, the algorithm that stores all the data, and then retrains everything from scratch each time a new sample arrives. Without computational constraints, such an algorithm could be classified as a lifelong learner; we do not think such a label is appropriate for that algorithm.
The goal in lifelong learning therefore is, given new data and a new task, use all the existing data to achieve lower generalization error on this new task, while also using the new data to obtain a lower generalization error on the previous tasks. This is distinct from classical online learning scenarios, because the previously experienced tasks may recur, so we are concerned about maintaining and improving performance on those tasks as well. Previous work in lifelong learning falls loosely into two algorithmic camps: (i) continually updating a fixed parametric model as new tasks arrive, and (ii) adding resources as new tasks arrive. Some approaches additionally store or replay previously encountered data to reduce forgetting [23][24][25]. In 'task-aware' scenarios, the learner is aware of all task details for all tasks, meaning that the hypotheses are of the form h : X × T → Y. In 'task-unaware' (or task agnostic [26]) scenarios the learner may not know that the task has changed at all, which means that the hypotheses are of the form, h : X → Y. We only address task-aware scenarios here.

Reference algorithms
We compared our approaches to nine reference lifelong learning methods. These algorithms can be classified into two groups based on whether they build new resources, or leverage fixed resources, given new tasks. Among [27], all have fixed capacity resources. For the first variant of exact replay, referred to as "Total replay", we replay all the data from all previous tasks whenever a new task is encountered. In the lifelong learning literature this is typically called "offline training". Replaying everything might however not be needed [25], and for the second variant of exact replay the amount of replay for each new task is fixed to the number of training samples in the new task, and the samples to be replayed are randomly selected from all the data of the previous tasks. For the baseline 'None', the network was incrementally trained on all tasks in the standard way while always only using the data from the current task. The implementations for all of the algorithms are adapted from open source codes [17,28]; for implementation details, see Appendix D.
3 Evaluation Criteria Others have previously introduced criteria to evaluate transfer, including forward and backward transfer [29,30]. These definitions typically compare the difference, rather than the ratio, between learning with and without transfer. Pearl [19] introduced the transfer benefit ratio, which builds directly off relative efficiency from classical statistics [22]. Our definitions are closely related to his. Transfer efficiency is the ratio of the generalization error of (i) an algorithm that has learned only from data associated with a given task, to (ii) the same learning algorithm that also has access to other data. Let R t be the risk associated with task t, and S t n be the data from S n that is specifically associated with task t, so R t (f (S t n )) is the risk on task t of the hypothesis learned by f only on task t data, and R t (f (S n )) denotes the risk on task t of the hypothesis learned on all the data.
Definition 1 (Transfer Efficiency). The transfer efficiency of algorithm f for given task t with sample size n is TE t n (f ) := E R t f (S t n ) /E R t (f (S n )) . We say that algorithm f has transfer learned for task t with data S n if and only if TE t n (f ) > 1.
To evaluate a lifelong learning algorithm while respecting the streaming nature of the tasks, it is convenient to consider two extensions of transfer efficiency. Forward transfer efficiency is the expected ratio of the risk of the learning algorithm with (i) access only to task t data, to (ii) access to the data up to and including the last observation from task t. This quantity measures the relative effect of previously seen out-of-task data on the performance on task t. Formally, let N t = max{i : T i = t}, be the index of the last occurrence of task t in the data sequence. Let S <t all data up to and including that data point.
Definition 2 (Forward Transfer Efficiency). The forward transfer efficiency of f for task t given n samples is . We say an algorithm (positive) forward transfers for task t if and only if FTE t n (f ) > 1. In other words, if FTE t n (f ) > 1, then the algorithm has used data associated with past tasks to improve performance on task t.
One can also determine the rate of backward transfer by comparing R t f (S <t n ) to the risk of the hypothesis learned having seen the entire training dataset. More formally, backward transfer efficiency is the expected ratio of the risk of the learned hypothesis with (i) access to the data up to and including the last observation from task t, to (ii) access to the entire dataset. Thus, this quantity measures the relative effect of future task data on the performance on task t.

Representer
Decider Voter Figure 1: Schemas of composable hypotheses. Ensembling voters is a well-established practice, including random forests and gradient boosted trees. Ensembling representations was previously used in lifelong learning scenarios, but without connections from future tasks to past ones. We introduce such connections, thereby enabling backward transfer.
Definition 3 (Backward Transfer Efficiency). The backward transfer efficiency of f for task t given n samples is . We say an algorithm (positive) backward transfers for task t if and only if BTE t n (f ) > 1. In other words, if BTE t n (f ) > 1, then the algorithm has used data associated with future tasks to improve performance on previous tasks.
After observing m tasks, the extent to which the TE for the j th task comes from forward transfer versus from backwards transfer depends on the order of the tasks. If we have a sequence in which tasks do not repeat, transfer efficiency for the first task is all backwards transfer, for the last task it is all forwards transfer, and for the middle tasks it is a combination of the two. In general, TE factorizes into FTE and BTE: Throughout, we will report log TE so that positive transfer corresponds to TE > 1.

Omnidirectional Algorithms
Our approach to lifelong learning relies on hypotheses that can be decomposed into three constituent parts: h(·) = w • v • u(·) ( Figure 1A). The representer, u : X →X , maps an X -valued input into an internal representation spaceX [31,32]. The voter v :X → ∆ Y maps the transformed data into a posterior distribution (or, more generally, a score) on the response space Y. Finally, a decider w : ∆ Y → Y, produces a predicted label. 1 See Appendix A for a concrete example using a decision tree.
One can generalize the above decomposition by allowing for multiple representers. Given B different representers, one can attach a single voter to each representer, yielding B different voters (Figure 1B). Doing so requires generalizing the definition of a decider, which would operate on multiple voters. The decider is then said to ensemble the voters. This is the learning paradigm behind boosting [35] and bagging [36]-indeed, decision forests are a canonical example of a decision function operating on a collection of B outputs [37]. A decision forest learns B different decision trees, each of which has a tree structure corresponding to a representer. Each tree is assigned a voter that outputs that single tree's guess as to probability that an observation is in any class. The decider outputs the most likely class averaged over the trees.
A further generalization of the above decomposition allows for each voter to ensemble the representers ( Figure 1C). Doing so requires the introduction of an omni-voter layer, which is formally distinct from the voter function described above that operates solely on a single representer. The omni-voter ensembles all the existing representations, regardless of the order in which they were learned. In this scenario, like with bagging and boosting, the ensemble of voters then feeds into the single decider. When each representer has learned complementary representations, this latter approach has certain appealing properties, particularly in multiple task scenarios, including lifelong learning. See Appendix B for a concrete example. We developed two different omnidirectional learning algorithms that ensemble representations.
An Omnidirectional Forest (Odif) is a decision forest-based instance of ensembling representations. For each task, the transformer u t of a Odif is a decision forest [37,38]. The leaf nodes of each decision tree partition the input space X [39]. The representation of x ∈ X corresponding to a single tree can be a one-hot encoded L b -dimensional vector with a 1 in the location corresponding to the leaf x falls into of tree b. The representation of x resulting from the collection of trees simply concatenates the B one-hot vectors from the B trees. Thus, the the transformer u t is the mapping from X to a Bsparse vector of length B b=1 L b . The posteriors are learned by populating the cells of the partitions and taking class votes with out-of-bag samples, as in 'honest trees' [39][40][41]. The posteriors output the average normalized class votes across the collection of trees, adjusted for finite sample bias [42]. The decider w t averages the posterior estimates and outputs the argmax to produce a single prediction. Recall that honest decision forests are universally consistent classifiers and regressors [41], meaning that with sufficiently large sample sizes, under suitable though general assumptions, they will converge to minimum risk. The single task version of this approaches simplifies to an approach called 'Uncertainty Forests' [42]. Table 1 in the appendix lists the hyperparameters used in the CIFAR experiments. An Omnidirectional Network (Odin) is a deep network (DN) based instance of ensembling representations. For each task, the representer u t in an Odin is the "backbone" of a DN, including all but the final layer. Thus, each u t maps an element of X to an element of R d , where d is the number of neurons in the penultimate layer of the DN. In practice, we use the architecture described in van de Ven et al. [25] as "5 convolutional layers followed by 2 fully-connected layers each containing 2,000 nodes with ReLU non-linearities and a softmax output layer." We trained this network using cross-entropy loss and the Adam optimizer [43] to learn the transformer. The omni-voters are learned via k-Nearest Neighbors (k-NN) [44]. Recall that a k-NN, with k chosen such that as the number of samples goes to infinity, k also goes to infinity, while k n → 0, is a universally consistent classifier [44]. We use k = 16 log 2 n, which satisfies these conditions.
In either of the cases, Odif and Odin, as new data from a new task arrives, our algorithms first build a new independent representer (using forests or networks). Then, it builds the voter for this new task, which intergrates information across all existing representers, thereby enabling forward transfer. If new data arrive from an old task, it can leverage the new representers to update the voters from the old tasks, thereby enabling backward transfer. In either case, new test data are passed through all existing representers and corresponding voters to make a prediction. Note that while updating the previous task voters with the cross task posteriors, we do not need to subsample the previous task data (see Appendix C for implementation details and pseudocodes).
Odin was motivated by ProgNN, but differs from ProgNN in two key ways. First, recall that ProgNN builds a new neural network 'column' for each new task, and also builds lateral connections between the new column, and all previous columns. In contrast, Odin excludes those lateral connections, thereby greatly reducing the number of parameters and train time. Moreover, this makes each representation independent, thereby potentially avoiding interference across representations. Second, for inference on task j data, assuming we have observed tasks up to J > j, ProgNN only leverages representations learned from tasks up to j, thereby excluding tasks j + 1, . . . , J. In contrast, Odin leverages representations from all J tasks. This difference enables backward transfer.
Odif adds yet another difference as compared to Odin by replacing the deep network representers with random forest representers. This has the effect of making the capacity, space complexity, and time complexity scale with the complexity and sample size of each task. In contrast, both ProgNN and Odin have a fixed capacity for each task, even if the tasks have very different sample sizes and complexities. Table 1: Capacity, space, and time constraints of the representation learned by various lifelong learning algorithms. We show soft-O notation (Õ(·, ·) defined in main text) as a function of n = T t nt and T , as well as the common setting where n is proportional to T . Our omnidirectional algorithms are the only algorithms whose representation space grows, but subquadratically with n or T , and Odif is the only algorithm whose time complexity is linear in n for learning the representation.

Parametric
Capacity Space Time Examples

A computational taxonomy of lifelong learning
Lifelong learning approaches can be divided into those with fixed resources, and those with growing resources. We therefore quantify the computational space and time complexity of the internal representation of a number of algorithms, using both theoretical analysis and empirical investigations. We also study the representation capacity of these algorithms. We use the soft-O notationÕ to quantify complexity [45]. Letting n be the sample size and T be the number of tasks, we write that a lifelong learning algorithm is f (n, t) =Õ(g(n, T )) when |f | is bounded above asymptotically by a function g of n and T up to a constant factor and polylogarithmic terms. Table 1 summarizes the capacity, space and time complexity of several reference algorithms, as well as our Odin and Odif. For the deep learning methods, we assume that the number of iterations is proportional to the number of samples. For space and time complexity, the table shows results as a function of n and T , as well as the common scenario where sample size per task is fixed and therefore proportional to the number of tasks, n ∝ T .
Fixed resource lifelong learning methods are parametric, in that the representational capacity is invariant to sample size and task number, have computational space complexity ofÕ(1) [22]. Given a sufficiently large number of tasks, without placing constraints on the relationship between the tasks, eventually all parametric methods will catastrophically forget at least some things. EWC, Online EWC, SI, and LwF are all examples of parametric lifelong learning algorithms.
Semi-parametric algorithms are algorithms whose representational capacity grows slower than sample size. For example, if T is increasing slower than n (e.g., T ∝ log n), then algorithms whose capacity is proportional to T are semi-parametric. ProgNN is semi-parametric with space complexitỹ O(T 2 ) due to the lateral connections. Moreover, the time complexity for ProgNN also scales quadratically with n when n ∝ T . Thus, an algorithm that literally stores all the data it has ever seen, and retrains a fixed size network on all that data with the arrival of each new task, would have smaller space complexity and the same time complexity as ProgNN. For comparison, we implement such an algorithm and refer to it as Total Replay. DF-CNN improves upon ProgNN by introducing a knowledge base with lateral connections to each new column, thereby avoiding all pairwise connections. Because these semi-parametric methods have a fixed representational capacity per task, they will either lack the representation capacity to perform well given sufficiently complex tasks, and/or will waste resources for very simple tasks.
Odin eliminates the lateral connections between columns of the network, thereby reducing space complexity down toÕ(T ). Odin stores all the data to enable backwards transfer, but retains linear time complexity. Odif is the only non-parametric lifelong learning algorithm to our knowledge. Its capacity, space and time complexity are allÕ(n), meaning that its representational capacity naturally increases with the complexity of each task.

Illustrating Omnidirectional Learning with Odif
Omnidirectional learning in a simple environment Consider a very simple two-task environment: Gaussian XOR and Gaussian Exclusive NOR (XNOR) (Figure 2A, see Appendix E for details). The two tasks share the exact same discriminant boundaries: the coordinate axes. Thus, transferring from one task to the other merely requires learning a bit flip. We sample 750 samples from XOR, followed by another 750 samples from XNOR.
Odif and random forests (RF) achieve the same generalization error on XOR when training with XOR data (Figure 2Bi). But because RF does not account for a change in task, when XNOR data appear, RF performance on XOR gets worse and worse. In contrast, Odif continues to improve on XOR given XNOR data, demonstrating backwards transfer. Now consider the generalization error on XNOR ( Figure 2Bii). Both Odif and RF are at chance levels when only XOR data are available. When XNOR data are available, RF must unlearn everything it learned from the XOR data, and thus its performance on XNOR starts out nearly maximally inaccurate, and quickly improves. On the other hand, because Odif can leverage the representer learned using the XOR data, upon getting any XNOR data, it immediately performs quite well, and then continues to improve with further XNOR data, demonstrating forward transfer (Figure 2Biii). Odif demonstrates positive forward and backward transfer for all sample sizes, whereas RF fails to demonstrate forward or backward transfer, and eventually catastrophically forgets the previous tasks.

Omnidirectional learning in adversarial environments
Statistics has a rich history of robust learning [46], and machine learning has recently focused on adversarial learning [47]. However, in both cases the focus is on adversarial examples, rather than adversarial tasks. In the context of omnidirectional learning, we informally define a task t to be adversarial with respect to task t ′ if the true joint distribution of task t, without any domain adaptation, impedes performance on task t ′ . In other words, training data from task t can only add noise, rather than signal, for task t ′ . An adversarial task for Gaussian XOR is Gaussian XOR rotated by 45 • (R-XOR) (Figure 2Aiii). Training on R-XOR therefore impedes the performance of Odif on XOR, and thus backward transfer falls below one, demonstrating graceful forgetting (Figure 2Ci). Because R-XOR is more difficult than XOR for Odif (because the discriminant boundaries are oblique [48]), and because the discriminant boundaries are learned imperfectly with finite data, data from XOR can actually improve performance on R-XOR, and thus forward transfer is positive. In contrast, both forward and backward transfer are negative for RF.
To further investigate this relationship, we designed a suite of R-XOR examples, varying the rotation angle θ between 0 • and 360 • , sampling 100 points from XOR, and another 100 from each R-XOR ( Figure 2Cii). As the angle increases from 0 • to 45 • , log BTE flips from positive (≈ 0.18) to negative (≈ −0.11). The 45 • -XOR is the maximally adversarial R-XOR. Thus, as the angle further increases, log BTE increases back up to ≈ 0.18 at 90 • , which has an identical discriminant boundary to XOR. Moreover, when θ is fixed at 25 • , BTE increases at different rates for different sample sizes of the source and the target task ( Figure 2Ciii).
Together, these experiments indicate that the amount of transfer can be a complicated function of (i) the difficulty of learning good representations for each task, (ii) the relationship between the two tasks, and (iii) the sample size of each. Appendix E further investigates this phenomenon in a multi-spiral environment.

Real data experiments
We consider two modalities for real data experiments: vision and language. Below we provide a detailed analysis of the performance of lifelong learning algorithms in vision data; Appendix F provides details for our language experiments, which have qualitatively similar results illustrating that Odif is a modality agnostic, sample and computationally efficient, lifelong learning algorithm.
The CIFAR 100 challenge [49], consists of 50,000 training and 10,000 test samples, each a 32x32 RGB image of a common object, from one of 100 possible classes, such as apples and bicycles. CIFAR 10x10 divides these data into 10 tasks, each with 10 classes [17] (see Appendix F for details). We In contrast, while neither ProgNN nor DF-CNN exhibit catastrophic forgetting, they also do not exhibit any positive backward transfer. Final transfer efficiency per task is the transfer efficiency associated with that task having seen all the data. Odif and Odin both demonstrate positive final transfer efficiency for all tasks, whereas ProgNN and DF-CNN both exhibit negative final transfer efficiency for at least one task.

Resource Constrained Experiments
It is possible that the above algorithms are leveraging additional resources to improve performance without meaningfully transferring information between representations. To address this concern, we devised a 'resource constrained' variant of Odif. In this constrained variant, we compare the lifelong learning algorithm to its single task variant, but ensure that they both have the same amount of resources. For example, on Task 2, we would compare Odif with 20 trees (10 trained on 500 samples from Task 1, and another 10 trained on 500 samples from Task 2) to RF with 20 trees (all trained on 500 samples Task 2). If Odif is able to meaningfully transfer information across tasks, then its resource-constrained FTE and BTE will still be positive. Indeed, FTE remains positive after enough tasks, and BTE is actually invariant to this change (Figure 3, bottom left and center). In contrast, all of the reference algorithms that have fixed resources exhibit negative forward and backward transfer. Moreover, the reference algorithms also all exhibit negative final transfer efficiency on each task, whereas our resource constrained Odif maintains positive final transfer on every task ( Figure 3, top right). Interestingly, when using 5,000 samples per task, replay methods are able to demonstrate positive forward and backwards transfer (Supplementary Figure 4), although they require quadratic time. Note that in this experiment, building the single task learners actually required substantially more resources, specifically, 10 + 20 + · · · + 100 = 550 trees, as compared with only 100 trees in the prior experiments. In general, to ensure single task learners use the same amount of resources per task as omnidirectional learners requiresÕ(n 2 ) resources, where as Odif only requiresÕ(n), a polynomial reduction in resources.

Resource Recruiting Experiments
The binary distinction we made above, algorithms either build resources or reallocate them, is a false dichotomy, and biologically unnatural. In biological learning, systems develop from building (juvenile) to recruiting (adult) resources. We therefore trained Odif on the first nine CIFAR 10x10 tasks using 50 trees per task, with 500 samples per task. For the tenth task, we could (i) select the 50 trees (out of the 450 existing trees) that perform best on task 10 (recruiting), (ii) train 50 new trees, as Odif would normally do (building), (iii) build 25 and recruit 25 trees (hybrid), or (iv) ignore all prior trees (RF). Odif outperforms other approaches except when 5,000 training samples are available, but the recruiting approach is nearly as good as Odif (Figure 3, bottom right). This result motivates future work to investigate optimal strategies for determining how to optimally leverage existing resources given a new task, and task-unaware settings.
Adversarial Experiments Consider the same CIFAR 10x10 experiment above, but, for tasks two through nine, randomly permute the class labels within each task, rendering each of those tasks adversarial with regard to the first task (because the labels are uninformative). Figure 4A indicates that backward transfer efficiency for both Odif and Odin is invariant to such label shuffling (the other algorithms also seem invariant to label shuffling, but did not demonstrate positive backwards transfer). Now, consider a Rotated CIFAR experiment, which uses only data from the first task, divided into two equally sized subsets (making two tasks), where the second subset is rotated by different amounts (Figure 4, right). Transfer efficiency of both Odif and Odin is nearly invariant to rotation angle, whereas the other approaches are far more sensitive to rotation angle. Note that zero rotation angle corresponds to the two tasks having identical distributions.

Discussion
We introduced quasilinear representation ensembling as an approach to omnidirectional lifelong learning. The two specific algorithms we developed, Odif and Odin, demonstrate the possibility of achieving both forward and backward transfer, due to leveraging resources (representers) learned for other tasks without undue computational burdens. Forest-based representation ensembling approaches can easily add new resources when appropriate. This work further therefore motivates additional work on deep learning to enable dynamically adding resources when appropriate [51].
To achieve backward transfer, Odif and Odin stored old data to vote on the newly learned transformers. Because the representation space scales quasilinearly with sample size, storing the data does not increase the computational complexity of the algorithm, and it remains quasilinear. It could be argued that by keeping old data and training a model with increasing capacity from scratch (a sequential multitask learning approach), it would be straightforward to maintain performance (TE = 1) in a partic- Odif and Odin. Thus, one natural extension of this work would obviate the need to store all the data by using a generative model. While we employed representation ensembling to address catastrophic forgetting, the paradigm of ensembling representations rather than learners can be readily applied more generally. For example, 'batch effects' (sources of variability unrelated to the scientific question of interest) have plagued many fields of inquiry, including neuroscience [52] and genomics [53]. Similarly, federated learning is becoming increasingly central in artificial intelligence, due to its importance in differential privacy [54]. This may be particularly important in light of global pandemics such as COVID-19, where combining small datasets across hospital systems could enable more rapid discoveries [55].
Finally, our representation ensembling approach closely resembles the contructivist view of brain development [56,57]. According to this view, the brain goes through progressive elaboration of neural circuits resulting in an augmented cognitive representation while maturing in a certain skill. In a similar way, omnidirectional algorithms can mature in a particular skill such as vision tasks by learning a rich representer dictionary from different vision datasets and thereby, transfer forward to future or yet unseen vision dataset (see CIFAR 10x10 recruitment experiment as a proof). However, there is also substantial pruning during development and maturity in the brain circuitry which is important for performance [58]. This motivates future work for pruning adversarial representers to enhance the transferabilty among tasks even more. Moreover, By carefully designing experiments in which both behaviors and brain are observed while learning across sequences of tasks (possibly in multiple stages of neural development or degeneration), we may be able to learn more about how biological agents are able to omnidirectilonally learn so efficiently, and transfer that understanding to building more effective artificial intelligences. In the meantime, our code, including code to reproduce the experiments in this manuscript, is available from http://proglearn.neurodata.io/. [

Appendix A. Decision Tree as a Compositional Hypothesis. Consider learning a decision tree
for a two class classification problem. The input to the decision tree is a set of n feature-vector/response pairs, (x i , y i ). The learned tree structure corresponds to the representer u, because the tree structure maps each input feature vector into an indicator encoding in which leaf node each feature vector resides.
Formally, u : X → [L], where [L] = {1, 2, . . . , L} and L is the total number of leaf nodes. In other words, u maps from the original data space, to a L-dimensional one-hot encoded sparse binary vector, where the sole non-zero entry indicates in which leaf node a particular observation falls, that is, Learning the voter is simply a matter of counting the fraction of observations in each leaf per class.
So, the voter is trained using n pairs of transformed feature-vector/response pairs (x i , y i ), and it assigns a probability of each class in each leaf: {v l := P[y i = 1|x i = l], ∀l ∈ [L]} and v(x) = vx. In other words, for two class classification, v maps from the L-dimensional binary vector to the probability that x is in class 1. The decider is simply w (v(x)) = ✶ {v(x)>0.5} , that is, it outputs the most likely class label of the leaf node that x falls into. For inference, the tree is given a single x, and it is passed down the tree until it reaches a leaf node, where it is represented by its leaf identifierx. The voter takesx as input, and outputs the estimated posterior probability of being in class 1 for the leaf node in whichx resides: is bigger than 0.5, the decider decides that x is in class 1, and otherwise, it decides it is in class 0.

Appendix B. Compositional Representation Ensembling.
Consider a scenario in which we have two tasks, one following the other. Assume that we already learned a single decomposable hypothesis for the first task: w 1 • v 1 • u 1 , and then we get new data associated with a second task. Let n 1 denote the sample size for the first task, and n 2 denote the sample size for the second task, and n = n 1 + n 2 . The representation ensembling approach generally works as follows. First, since we want to transfer forward to the second task, we push all the new data through the first representer u 1 , which yieldsx (1) n . Second, we learn a new representer u 2 using the new data, {(x i , y i )} n i=n 1 +1 . We then push the new data through the new representer, yieldingx for j = 1, 2. The output of v 2 for any new input x is the posterior probability (or score) for that point for each potential response in task two (class label). Thus, by virtue of ensembling these representations, this approach enables forward transfer [16,59]. Now, we would also like to improve performance on the first task using the second task's data. While many lifelong methods have tried to achieve this kind of backward transfer, to date, they have mostly failed [15]. Recall that previously we had already pushed all the first task data through the first task representer, which had yieldedx (1) 1 , . . . ,x (1) n 1 . Assuming we kept any of the first task's data, or can adequately simulate it, we can push those data through u 2 to get a second representation of the first task's data:x (2) 1 , . . . ,x (2) n 1 . Then, v 1 would be trained on both representations of the first task's data.
This 'replay-like' procedure facilitates backward transfer, that is, improving performance on previous tasks by leveraging data from newer tasks. Both the forward and backward transfer updates can be implemented every time we obtain data associated with a new task. Enabling the omni-voters to ensemble omnidirectionally between all sets of tasks is the key innovation of our proposed omnidirectional learning approaches. Appendix C. Omnidirectional Algorithms. We propose two concrete omnidirectional algorithms, Omnidirectional Forests (Odif) and Omnidirectional Networks (Odin). The two algorithms differ in their detais of how to update representers and voters, but abstracting a level up they are both special cases of the same procedure. Let Odix refer to any possible omnidirectional algorithm. Algorithms 1, 2, 3, and 4 provide pseudocode for adding representers, updating voters, and making predictions for any Odix algorithm; the below sections provide Odif and Odin specific details.
Algorithm 1 Add a new Odix representer for a task. OOB = out-of-bag.

Input:
(1) t ⊲ current task number (2) D t n = (x t , y t ) ∈ R n×p × {1, . . . , K} n ⊲ training data for task t Output: (1) u t ⊲ a representer set (2) I t OOB ⊲ a set of the indices of OOB data 1: function Odix.FIT(t, (x t , y t )) 2: ⊲ train a representer X on bootstrapped data 3: return u t , I t OOB 4: end function Algorithm 2 Add a new Odix voter for the current task.

Input:
(1) t ⊲ current task number . . , K} n ⊲ training data for task t (4) I t OOB ⊲ a set of the indices of OOB data for the current task ⊲ in-task (t ′ = t) and cross-task (t ′ = t) voters for task t ⊲ add the in-task voter using OOB data 3: for t ′ = 1, . . . , t − 1 do ⊲ update the cross task voters for task t 4: v tt ′ ← u t ′ .add_voter(x t , y t ) 5: end for 6: return v t 7: end function Appendix D. Reference Algorithm Implementation Details. The same network architecture was used for all compared deep learning methods. Following van de Ven et al. [25], the 'base network architecture' consisted of five convolutional layers followed by two-fully connected layers each containing 2000 nodes with ReLU non-linearities and a softmax output layer. The convolutional layers had 16, 32, 64, 128 and 254 channels, they used batch-norm and a ReLU non-linearity, they had a 3x3 kernel, a padding of 1 and a stride of 2 (except the first layer, which had a stride of 1). This architecture was used with a multi-headed output layer (i.e., a different output layer for each task) for all algorithms using a fixed-size network. For ProgNN and DF-CNN the same architecture was used for each column introduced for each new task, and in our Odin this architecture was used for the transformers u t (see above). In these implementations, ProgNN and DF-CNN have the same architecture for each column introduced for each task. Each column has an input layer followed by 4 convolutional layer with size 3 × 3 × 32, 3 × 3 × 32, 3 × 3 × 64 and 3 × 3 × 64, respectively. It is followed by a fully-connected layer with 64 nodes and an output layer with 10 nodes. ReLU activation was used after each layer. The other algorithms use a common architecture with input layers defined by the size of the input data, two hidden layers with 400 nodes each and a multi-headed output layer (different output layers for different tasks). Different algorithms only differ in the way they penalize the update of network parameters for the current task based on the previous tasks. Each of these algorithms has 1.4M parameters in total.
Appendix E. Simulated Results. In each simulation, we constructed an environment with two tasks. For each, we sample 750 times from the first task, followed by 750 times from the second task. These 1,500 samples comprise the training data. We sample another 1,000 hold out samples Algorithm 3 Update Odix voter for the previous tasks.
Input: for t ′ = 1, . . . , T do ⊲ update the posteriors calculated from T task voters 5:p t ←p t + v tt ′ .predict_proba(u t ′ (x)) 6: end for 7:p t ←p t /T 8:ŷ = argmax i (p t ) ⊲ find the index i of the elements in the vectorp t with maximum probability 9: returnŷ 10: end function to evaluate the algorithms. We fit a random forest (RF) (technically, an uncertainty forest which is an honest forest with a finite-sample correction [42]) and a Odif. We repeat this process 30 times to obtain errorbars. Errorbars in all cases were negligible. E.1 Gaussian XOR Gaussian XOR is two class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means ± 0.5, 0.5 T , and variances proportional to the identity matrix. Conditioned on being in class 1, a sample is drawn from a mixture of two Gaussians with means ± 0.5, −0.5 T , and variances proportional to the identity matrix. Gaussian XNOR is the same distribution as Gaussian XOR with the class labels flipped. Rotated XOR (R-XOR) rotates XOR by θ • degrees.

E.2 Spirals
A description of the distributions for the two tasks is as follows: let K be the number of classes and S ∼ multinomial( 1 K 1 K , n). Conditioned on S, each feature vector is parameterized by two variables, the radius r and an angle θ. For each sample, r is sampled uniformly in [0, 1]. Conditioned on a particular class, the angles are evenly spaced between where t K controls the number of turns in the spiral. To inject noise along the spiral, we add Gaussian noise to the evenly spaced angles θ ′ : θ = θ ′ + N (0, σ 2 K ). The observed feature vector is then (r cos(θ), r sin(θ). In  Figure 1 we set t 3 = 2.5, t 5 = 3.5, σ 2 3 = 3 and σ 2 5 = 1.876.
Consider an environment with a three spiral and five spiral task (Figure 1). In this environment, axis-aligned splits are inefficient, because the optimal partitions are better approximated by irregular polytopes than by the orthotopes provided by axis-aligned splits. The three spiral data helps the five spiral performance because the optimal partitioning for these two tasks is relatively similar to one another, as indicated by positive forward transfer. This is despite the fact that the five spiral task requires more fine partitioning than the three spiral task. Because Odif grows relatively deep trees, it over-partitions space, thereby rendering tasks with more coarse optimal decision boundaries useful for tasks with more fine optimal decision boundaries. The five spiral data also improves the three spiral performance.
Appendix F. Real Data Extended Results.

F.1 Spoken Digit Experiment
In this experiment, we used the spoken digit dataset provided in https://github.com/Jakobovski/free-spoken-digit-dataset. The dataset contains audio recordings from 6 different speakers with 50 recordings for each digit per speaker (3000 recordings in total). The experiment was set up with 6 tasks where each task contains recordings from only one speaker. recording, a spectrogram was extracted using Hanning windows of duration 16 ms with an overlap of 4 ms between the adjacent windows. The spectrograms were resized down to 28 × 28. The extracted spectrograms from 8 random recordings of '5' for 6 speakers are shown in Figure 2. For each Monte Carlo repetition of the experiment, spectrograms extracted for each task were randomly divided into 55% train and 45% test set. As shown in Figure 3, both Odif and Odin show positive transfer between the spoken digit tasks, in contrast to other methods, some of which show only forwards transfer, others show only backwards transfer, with none showing both, and some showing neither.   Table 3 shows the image classes associated with each task number. Supplementary Figure 4 is the same as Figure 3 but with 5,000 training samples per task, rather than 500. Notably, with 5,000 samples, replay methods are able to transfer both forward and backward as well. However, note that although total replay outperforms both Odif and Odin with large sample sizes, it is not a bona fide lifelong learning algorithm, because it requires n 2 time. Moreover, the replay methods will eventually forget as more tasks are introduced because it will run out of capacity. Figure 5 shows the same result as the label shuffling from Figure 4, but with 5,000 samples per class. The results for Odin and Odif are qualitatively similar, in that they transfer backwards. The replay methods are also able to transfer when using this larger number of samples, although with considerably higher computational cost.

F.4 CIFAR 10x10 Repeated Classes
We also considered the setting where each task is defined by a random sampling of 10 out of 100 classes with replacement. This environment is designed to demonstrate the effect of tasks with shared subtasks, which is a common property of real world lifelong learning tasks. Supplementary Figure 6 shows transfer efficiency of Odif and Odin on Task 1. Odif still demonstrates positive forward, backward, and final transfer, unlike most of the state-of-the-art algorithms, which demonstrate forgetting. The replay methods, however, do demonstrate transfer, albeit with significantly higher computational cost.