Deep Learning Without Neural Networks: Fractal-nets for Rare Event Modeling

Complex phenomena of societal interest such as weather, seismic activity and urban crime, are often punctuated by rare and extreme events 1 , which are difﬁcult to model and predict. Evidence of long-range persistence 2,3 of such events has underscored the need to learn deep stochastic structures in data for effective forecasts. Recently neural networks (NN) have emerged as a defacto standard for deep learning 4–9 . However, key problems remain with NN inference, including a high sample complexity, a general lack of transparency, and a limited ability to directly model stochastic phenomena. In this study we suggest that deep learning and the NN paradigm are conceptually distinct – and that it is possible to learn “deep” associations without invoking the ubiquitous NN strategy of global optimization via back-propagation 10 . We show that deep learning of stochastic phenomena is related to uncovering the emergent self-similarities in data, which avoids the NN pitfalls offering crucial insights into underlying mechanisms. Using the Fractal Net (FN) architecture introduced here, we actionably forecast various categories of rare weather and seismic events, and property and violent crimes in major US cities. Compared to carefully tuned NNs, we boost recall at 90% precision by 161:9% for extreme weather events, 191:3% for light-to-severe seismic events with magnitudes above the local third quartile, and 50:8% - 418:6% for urban crime, demonstrating applicability in diverse systems of societal interest. This study opens the door to precise prediction of rare events in spatio-temporal phenomena, adding a new tool to the data science revolution.

Neural Networks (NN) are now a defacto standard for deep learning [4][5][6][7][8][9]20 . The depth of the NN used to discover prediction-relevant features suggested and sustained the deep metaphor. However, for complex systems in the physical and social sciences, often the key missing piece is effective modeling of stochastic processes. And the fantastic performance of NNs in learning input-output deterministic functions [21][22][23] carries over imperfectly and with wasted expressive capacity to stochastic phenomena. In the context of low frequency seismic events and extreme weather and crime, we care about precise event localization, underlining the importance of explicit stochastic modeling.
To address this gap we prompt the reader to think of deep learning and the NN paradigm as conceptually distinct -we show that it is possible to learn "deep" associations without invoking the ubiquitous NN strategy of global optimization via backpropagation of the loss gradient 10 . Our key insight is tied to self-similar structures arising in ergodic stochastic processes which take values over a finite alphabet. If such a process has a welldefined set of dynamical states, self-similarity arises naturally: once a trajectory visits a particular state, we   prediction, these forecasts are actionable, i:e, are done early enough (U days for weather, Q-U days for urban crime and IPH days for earthquakes), and the events are crucial enough, to trigger mitigation responses. We aim to predict seismic events above the local third quartile magnitude, flagging events with magnitudes > R:U on average (% R:S near Los Angeles, USA) without discriminating between light (< S:H) or more severe events. Nevertheless we correctly predict 13 out of the largest 15 events in our out-of-sample period (August 2019 -Aug 2020) IPH ¦ Q days in advance (See Extended Data Fig. 1). i.
iv. ii. Transition Structure b. Structure Recovery via Self-similar Compression a. Fractal structures in simple processes

Fig. 2. Fractal Structures in Stochastic processes and Self-Similar Compression. Panel a.
We map the process states to the unit interval, associating the history of observations from stationary distribution to the binary representation of the points in H;I. Stacking these representations as the length of the observation increases reveals process-specific fractal structures shown in (i)-(iv) of panel a.
As an example in (i), because every sequence with the last symbol H leads to the state qH, we can interpret Uk as the subset of the unit interval where the k th digit in the binary representation is H. The block highlighted in red then maps from the sub-tree under the node reached by IH (and referred to as AIH) in the binary tree shown in panel b(ii). Also note that the highlighted block in panel a(i) is a scaled copy of the full representation. Panel b. The input data stream induces a m-ary tree, where m is the alphabet size, with probabilities attached to each branch in each split. SSC recursively identifies subtrees that are sufficiently similar to obtain a finite generative model.
For example for the the process in panel a(i), we note that the probability of a H from the stationary distribution is different from that after we observe a H, which is also different from after we observe a I -leading to producing P new states in the first step. But then we realize that the probability of any string after we observe HH is the same after observing a single H, implying we have a loop labeled H at the state reachable by the first H. Continuing like this, we distill the complete transition stricture in six steps as shown in panel b(iii).
can "forget" what happened before, implying that future trajectories from each distinct visit of the same state are statistically indistinguishable. Since an ergodic system must revisit its states, the observed dynamical behavior must have self-similar structures, with the set of future paths from any state essentially a scaled copy of the subset of futures from any subsequent visit of the same state. Fig. 2a demonstrates this idea for simple processes with a binary alphabet; we map current states to the unit interval, associating the transpired history with the binary representation of points on H;I, revealing the process-specific self-similar organization.
Uncovering hidden dynamical states is thus equivalent to recovering fractal structures in observed data, using what we refer to as self-similar compression (SSC) as illustrated in Fig. 2b.
In our approach to learning this emergent self-similar structure, we employ no "neurons", no fixed activation  A: AUC outperformance is measured as the percentage increase in the AUC averaged over the spatial tiles.
y B: Sensitivity outperformance is measured at WH7 positive predictive value (PPV).
functions, no user-specified loss functions, and no global optimization via back-propagation. Instead, SSC distills local models (See Fig. 3a), which are assembled into a predictive network (See Fig. 3c) which we call the Fractal Net (FN). FN is archetypically distinct from the familiar NN architecture (See Fig. 3b vs Fig. 3c for contrasting visuals). Each of our elemental units (SSC models) is a probabilistic finite state automaton (PFSA) or a generalized probabilistic automaton for cross-dependencies (crossed PFSA or XPFSA). Thus, given a source s and a target stream r, we infer models H s r;¡ (See Fig. 3a), which make predictions ¡ steps in future from the current observation step, in the stream r after making observations in the stream s. These models have a finite set of states, with the number of states, the state transition map and the output event probabilities all inferred from data without prior constraints.
Our models are discrete, and they learn from (and predict) categorical input streams. This is appropriately suited for modeling rare event dynamics, where we treat an event as a I and its absence H. In applications with continuous event magnitudes, such as seismic modeling, we use quantization to identify events of interest, e:g:, treating all events with magnitude above the local third quartile as a I.
Instead of the number of layers of neurons, depth in FNs is reflected by average number of states in the inferred SSC models. Instead of assuming fixed memoryless non-linear activation functions (such as tanh, rectified linear unit 10 etc.), here we infer local activation structure from data. The individual SSC models dictate link activation as a function of their current state, which is in turn determined by event history. Number of model states reflects the temporal depth that might be important to determine link state. This results in deeper models with a significantly smaller parameter set (See Extended Data Tab. 2).
Notably, as illustrated in Fig. 3c, each source-target link can have multiple SSC models operating at disparate time scales. Thus, in contrast to layer stacking in Long Short-term Memory (LSTM) NNs, we model multiple timescales explicitly (See Methods: Step Two and Extended Data Algorithm 1 line [11][12][13][14][15][16][17][18][19][20][21][22]. As an added bonus, this approach also addresses the problem of vanishing and exploding gradients. Back-propagation in NNs proceeds by updating network parameters as a function of the iterated loss gradient. Often this decay is too fast for effective learning 24 , and despite recent architectural modifications, the problem persists 24,25 . In contrast, SSC inference does not propagate any gradients; identifying a distinct model H s r;¡ for each value of ¡. We compare the decay behavior in the two frameworks in Fig. 3d, where 1D systems ranging from being linear (subpanel i) to those operating in non-linear chaotic regimes (subpanel ii-v) are analyzed. We find that the loss gradient vanishes exponentially fast in the case of NNs (column 4) irrespective of the system, whereas for FNs the corresponding measure -the coefficient of causality (Column 3) is analogous to the respective autocorrelation functions for x (Column 2).
As a final point of architectural contrast, we switch the order of the linear and non-linear operators: the nonlinear link activations are combined linearly to be passed on as inputs to downstream nodes (See Fig. 3d), i:e:, in FNs we have linear combination of inferred non-linear link activations instead of NNs where we have fixed non-linear activation of linear combination of nodal inputs. Local weights are easily computed with standard regressors, allowing local changes to be integrated on-the-fly (See Methods: Step One).
To briefly describe how SSC actually infers PFSAs or their crossed versions, we note that the input data stream induces a m-ary tree, where m is the alphabet size, with probabilities attached to each branch in each split. SSC recursively identifies subtrees that are sufficiently similar to obtain a finite generative model of the ergodic stationary process (See Fig. 2, Fig. 2).
Finally, precise results on sample complexity for NNs are unknown. In contrast, we establish explicit results on sample complexity, and show that we obtain good models with high probability (See Methods: Performance Analysis and Supplementary Text: Sec. VIII). More precisely, let "; " H be arbitrarily small positive numbers, with input length that is polynomial in I=" and logarithmic in I=" H , for a reasonably separated family of generating models, the probability that the difference between the inferred and true generating models is bigger than " as measured by the KL divergence @h KL A is upper-bounded by " H . i:e:, with input length n a O @poly@I=";log I=" H AA, A schematic map of how mathematical development leads to the performance bounds is shown in Extended Data Fig. 3.
To demonstrate FN applicability in diverse spatio-temporal phenomena, we consider 1) rare weather events in contiguous US, 2) global seismic events, and 3-5) urban crime in Atlanta GA, Chicago IL, and Philadelphia PA. These applications, enumerated in Tab. 1 highlight our strictly superior performance over carefully tuned LSTMs. In each of these cases, we begin with a spatio-temporal log enumerating events of interest, along with their space-time coordinates. For example, for US weather, we use events logged by the the Automated Surface Observing Systems (ASOS) network recording extreme precipitation and cold/snow events. The seismic event log is curated by the United States Geological Survey (USGS) hazards program, where we forecast events within Q ¢ Q tiles with a magnitude greater than the local third quartile of all events within the past decade.
For forecasting urban crime, we log daily occurrences of property crimes (consisting of burglary, theft etc.) and violent crimes (homicide, assault, battery etc.) within a couple of city blocks. We tune discrete time-steps automatically to maximize the average entropy rate of the data streams, resulting in steps measuring IP hours for the weather models, Q days for the seismic prediction, and I P days for urban crime. With the exception of the precipitation event in weather modeling, all event frequencies are lower than IH%. We also choose how far into future we make predictions (prediction horizon), which is chosen to be I week for weather, R months for seismic prediction, and Q U days for urban crime (See Tab. 1 for details) -performance modestly improves for shorter and degrades rapidly for longer horizons. All these predictions, except perhaps in the case of the seismic events, are actionable, i:e:, are done precisely and early enough to intervene or mitigate. For the seismic case, we target events with magnitude greater than the third quartile of the magnitudes of all local recorded events in the past decade. (See Extended Data Fig. 1a, mean: R:U). Thus, we do not discriminate between light (R-R:W) or more severe events to maintain statistical power. Nevertheless we correctly predict IQ out of the largest IS events in our out-of-sample period (August 2019 -Aug 2020) IPH ¦ Q days in advance (See Extended Data Fig. 1c-d). In all these problems, we significantly outperform carefully tuned LSTMs. As shown in Table 1, we boost sensitivity at WH7 positive predictive value by ITI:W7 for extreme weather events, IWI:Q7 for seismic activity over the local third quartile, ISH:H7, RIV:T7, and SH:V7 for criminal infractions in Chicago, Philadelphia and Atlanta respectively. Outperformance measured by the increased area under the receiver operating characteristic curve for the corresponding problems is given by IH:H7, W:P7, IU:U7, IH:H7, and IP:Q7 respectively.
Beyond predictive performance, FNs provide insight into dynamical properties, which is generally difficult with NNs. Each SSC model H s r;¡ has a coefficient of causality s r @¡A intuitively defined as: s r;¡ a I uncertainty of the output ¡ steps in future in r with observation of the past in s average uncertainty of the output ¡ steps in future in r specifying how much information about the ¡-step future of the target stream is obtained, per unit bit acquired about the source stream s. As shown in Fig. 3d, for 1D systems s s is analogous to autocorrelation: this is expected since higher autocorrelation implies higher predictability. In general, the average ¡-dependence of s s provides us with simple sanity checks. For example, the physical nature of weather and seismic systems   illustrates the variation of with distance. As required in physical systems (columns 1 and 2) we find a rapid decay. This is absent in the social systems (columns 3-5), suggesting long range persistence in organization akin to critical phenomena.
necessitates an influence decay as we move away in space and time from the event epicenters -no "teleportation" of influence should be possible. Thus, in the neighborhood of events we expect that on average, s s must decay with increasing ¡, and s r should decay as physical distance between the source and the target increases. Fig. 4 rows 2 and 3 illustrate that these patterns are correctly recovered for the weather and the seismic systems. Interestingly while the temporal decay also holds true for urban crime, we find no such decay in the spatial dimension for these social systems. This discrepancy possibly suggests that urban spaces operate as one single unit with long-ranged coherence not unlike self-organized criticality suspected to emerge in flocks of birds 26 and other physical systems near criticality.
In addition, s s @¡A also sheds light on event frequency after a rare event. Our models have specific states which have large event likelihoods. As we move away from these states, the event likelihood decays. In the context of earthquakes, the increased event frequency (aftershocks) following an event is known to rapidly decay with time according to the empirical Omori-Utsu law 27 where h@¡A is the binary entropy function (h@xA x log@xAC@I xA log@I xA), and } i is the stationary probability of the inferred state in stream s, p i is the event probability in stream r, and p o is the average event probability. Considering "self-models", i:e:, where s and r are the same, we have the bound: s s @¡A a@¡A C @I h@p E ¡ AA (4) where the subscript E refers to state(s) with the maximum event probability, and we use the fact that Vi} i P H; I with i } i a I and h@p H A I. Under the idealized assumption that events are unlikely from states other than E, a@¡A is a weak function of ¡, implying that the inferred vs ¡ plots suggest how the event frequency varies in the aftermath. Fig. 4 illustrates that observed behaviors may be approximately reconstructed from variations in the aftermath-event frequency. None of the above insights are possible within the NN framework, in which the gradient decay is purely an artifact of the optimization algorithm, and does not reflect system properties.
A key limitation of FNs is the need for categorical data in self-similar compression. In systems with continuousvalued observations, we can set a magnitude threshold effectively defining the events of interest (as demonstrated in seismic modeling). However more complex event definitions might be warranted elsewhere. Future research will investigate these issues, and attempt to address event frequencies significantly lower to what have been demonstrated here.
Thus, in this study we have laid the groundwork to broaden the applicability of data driven analytics to rare event modeling in complex systems. We hope that this technology, integrated with existing tools, will push the boundaries on our current limits of predictive mitigation of natural disasters and catastrophic societal events.

MATERIALS AND METHODS
Fractal Net is assembled from local SSC models which are, in general, crossed probabilistic automata (XPFSA). The theoretical development supporting the claims in the main text is presented in the Supplementary Text. A map of how the mathematical proofs relate to each other is presented in Extended Data Fig. 3, which shows that the key concepts (causal states, persistent causal states, synchronization, and accumulation measures) all come together to establish the correctness of the inference algorithm(s). The rest of this section describes the Fractal Net construction.
The construction of a Fractal Net consists of two steps: 1) local SSC model generation and network pruning and 2) local model aggregation for comprehensive prediction. As discussed in the main text, these local models determine link activation based on their current state, and event prediction is accomplished by aggregating these local activations via a local regressor. No global optimization of these aggregation function is necessary.
The model generation step of Fractal Net is accomplished by the algorithms GenESeSS (See Extended Data Algorithm 2) and xGenESeSS (See Extended Data Algorithm 3). GenESeSS and xGenESeSS are implementations of the self-similar compression (SSC) discussed in the main text. GenESeSS yields PFSA models that capture how the history of an input process influences its own future, and xGenESeSS produces XPFSA models that captures how the history of a source process influences the future of a target process. The Fractal Net construction is described in Extended Data Algorithm 1, and takes as input a set fx s X s P Sg of length-n time series, hyperparameters " and n H < n for local model inference, ¡ max for maximum time delay, and H for thresholding admissible models. For each target sequence x r , Fractal Net outputs a set of admissible models w r with a scalar weight for each model in w r via model inference and pruning (line 1-10) and training of the aggregation weights (line 11-22).

Step 1: Model inference and pruning
The Fractal Net framework models the influence from a source time series x s on a target time series x r at a particular time delay ¡ by an XPFSA H s r;¡ (line 7). Thus, we infer jSj¡ max XPFSA models for each x r which yields jSj P ¡ max models in total. Since the number of XPFSA models increases quadratically with the number of time series and strength of the links may vary, pruning low-performing models early is important for parsimony.
for each r P S.

Step 2: Train linear weights
In this step, we integrate the local models in x r 's admissible set for forecasting events in x r .
where @x s A h ¡ is the truncation of x s at h ¡. To compute the coefficients, we solve a regression problem Reg@X; yA (line 22) for each r P S with the predictor variables being predictions x t s; ¡ obtained by running each sequence @x s A nHCt ¡ through H s r;¡ (line 17), and the outcome variable being x r n H C t, value of x r at time n H C t (line 18). Hence, the X matrix is the @n n H A ¢ jw r j matrix with the entry indexed by t; @s; ¡A given by x t s; ¡ and y, the @n n H A-dimensional vector with the entry indexed by t given by x r n H C t. We can solve for the linear weights with any standard regressor.

Performance Analysis of GenESeSS
On line 6 and 7 of Extended Data Algorithm 1, Fractal Net calls subroutines GenESeSS and xGenESeSS. The two algorithms are conceptually similar: while the first infers PFSA as generators of stochastic processes, the second infers XPFSA as models of cross-dependencies between processes. Here, we establish the correctness of GenESeSS.
The inference algorithm for PFSA is called GenESeSS for Generator Extraction Using Self-similar Semantics. Both the derivation of the PFSA model and its SSC inference are based on the concept of causal state. A dynamical system reaches the same causal state via distinct paths if the futures are statistically indistinguishable.
More precisely, each process over an alphabet ¦ of size m gives rise naturally to an m-ary tree with the nodes at level d being sequences of length d, and the edge from the node x to x, P ¦, labeled by P r@jxA -the probability of observing as the next output after x. By the definition of causal state, if two subtrees are identical with respect to edge labels, then their roots are sequences that lead the system to the same causal state. We show in Supplementary Text Sec. II and III that, for a process of Markov order k, identifying all the roots of identical subtrees indeed offers a finite automaton structure whose unique strongly connected component is the generating PFSA of the process. The automaton structure obtained from this subtree "stitching" procedure is conceptualized formally in Defn. 1.
Definition 1 (Probabilistic Finite-State Automaton (PFSA)). A PFSA G is a quadruple @Q; ¦; ; e A, where Q is a finite set, ¦ is a finite alphabet, X Q ¢ ¦ 3 ¦ is called the transition map, and e X Q 3 P ¦ , where P ¦ is the space of probability distributions over ¦, is called the transition probability. (Supplementary Text Sec. II) Step 2 of the Extended Data Algorithm 2 (line 5-19) is an implementation this subtree "stitching" approach under finiteness of input data. Note that the criterion for "stitching" two subtrees with roots x and x H is that their edge labels are identical for all depths, which translates to p@yjxA a p@yjx H A for sequence y of all lengths. The criterion is not verifiable with finite data, and hence GenESeSS identifies two subtrees if they agree on depth one. Defining symbolic derivative x to be the vector with the entry indexed by given by p@jxA, GenESeSS identifies x and x H if x a x H. This approach works well under the assumption that the target PFSA is in general position, meaning that different causal states have distinct symbolic derivatives. In practice, GenESeSS uses empirical symbolic derivative defined below to approximate x . Let x be an input sequence of finite length, the empirical symbolic derivative x y of a sub-sequence y of x is a probability vector with the entry indexed by given by . By the general position assumption and assuming x is long enough, with high probability, no x is within an "-neighborhood of x H for H , and hence each is recorded as the identifier for a new state. In fact, GenESeSS will keep on appending symbols to identifiers of stored states and adding new states until it reaches a sequence of length k C I. Assuming y a I ¡ ¡ ¡ k kCI , since the process is of order k, we have y a z for z a P ¡ ¡ ¡ kCI , and hence, with high probability, x y and x z can be within an "-neighborhood of each other given long enough input x. In this case, GenESeSS identifies the state represented by y with that of z. In fact, GenESeSS will identify all states represented by sequences of length k C I to some previously-stored states. And since no new states can be found, GenESeSS exits the loop on line 8 after iteration k C I. Taking  Step 2 of GenESeSS will never exit in theory, since there exists no n P N such that every causal state is visited for sequences with length n. And if we implement an artificial exit criterion, the model inferred might be unnecessarily large, and have hard-to-model approximations. We address this issue via the notion of synchronization -the ability to identify that we are localized or synchronized to a particular state despite being uncertain of the initial state.
The solution for inferring succinct models in non-finite Markov order cases lies in the concept of synchronization. In Step 1 of Extended Data Algorithm 2 (line 1-4), GenESeSS finds an almost synchronizing sequence, which allows GenESeSS to distill a structure that is similar to that of the finite Markov order cases, and thus carry out the subtree "stitching" procedure described before.
A sequence x is synchronizing if all sequences that end with the suffix x terminates on the same causal state. A process is synchronizable if it has a synchronizing sequence, and a PFSA is synchronizable if the process it generates is synchronizable. All processes of finite Markov order, and hence their generating PFSA, are synchronizable.
Example: An example of synchronizable process without finite Markov order is given by the process generated by the PFSA in Extended Data Fig. 4c. The shortest synchronizing sequence of the PFSA is II. The binary tree of the process generated by the PFSA is given in Panel g, with darker nodes representing synchronizing sequences. See Supplementary Text Example 4 for detail. We show in Supplementary Text Sec. IV that, although a synchronizable process generated by a PFSA may have an infinite set of causal states, only finitely many among them are persistent, i:e: repeated with non-vanishing probabilities with increasing sequence length. We also show in Supplementary Text Sec. III that any process having a finite set of persistent causal states whose sum of probabilities approaches I as the sequence length increases has a PFSA generator whose state set Q is in one-to-one correspondence with the set of persistent causal states. In Extended Data Fig. 5a, we show the probability of sequences synchronizing to the three states of the PFSA in Extended Data Fig. 4c. We can see that the sum approaches I as the sequence length increases. Hence, although the process has no finite Markov order, it has a PFSA generator.
Since any sequence prefixed by a synchronizing sequence is synchronizing, as long as Step 1 of GenESeSS produces a synchronizing sequence, Step 2 of GenESeSS will work solely with causal states represented by a synchronizing sequence. The analysis above implies that, assuming a process has a PFSA generator, the PFSA obtained by subtree stitching on a subtree rooted at a synchronizing sequence is indeed the generating PFSA of the process. We show in Extended Data Fig. 6g the running tree of GenESeSS for the PFSA in Panel e. A running tree of GenESeSS rooted at state q visualizes the run of GenESeSS, given that the x H found in Step 1 is a synchronizing to state q, and GenESeSS correctly identifies all new and repeating states. Nodes colored orange in the tree are the new states GenESeSS finds in its run, while the gray states are the repeating ones. Note that, in the running tree, if we let the gray nodes (repeating states) to travel along the gray lines until they overlaps with the orange nodes (stored states) with a matching label, we get the PFSA in Panel e back.
Finally, we describe the scenario of recovering PFSA from non-synchronizable processes. An example of a non-synchronizable process is given in Extended Data Fig. 4d (See Supplementary Text Example 5 for detailed description). The directed graph obtained by the subtree stitching for sequences up length S is shown in Panel h. Recall that GenESeSS can recover synchronizable PFSA with the help of synchronizing sequence because of the one-to-one correspondence between the set persistent causal states and the state set of the PFSA. However, as we show in Supplementary Text Sec. III Example 7, the set of persistent causal state of this PFSA is empty. In this specific example, the set of causal states has the form fq n X n P Zg, where Z is the set of integers. The probability of the causal state q n for sequence length d, denoted as p d @q n A, can be calculated from the recursive formula (61) and we plot p d @q n A vs n in Extended Data Fig. 5b for d a IHH; PHH;QHH;RHH.
It follows from Eq. (61) and also by direct observation of Fig. 5b that, as the sequence length increases, the curves flatten out, suggesting lim d3I p d @q n A a H for all n and hence the PFSA has no persistent states.
In order to obtain a finite generator, we consider the topological properties of the set of causal states, specifically the closure of the set (See Supplementary Text Sec. III). We show that if the closure of the causal states has finitely many atomic accumulation points, then the process is generated by a PFSA whose set of states is in one-to-one correspondence with the set of atomic accumulation points of the set of causal states. In Extended Data Fig. 5C, we show the cumulative density function of x @HA (which is the probability of producing H after observing x) for jxj a PH and RH. We can see that the curve approaches a step function with two steps. In fact, we can show that the two steps correspond to q I and q I , the two limit points of the set of causal states. See

Supplementary Text Example 8 for detail.
From the analysis above, we see that the problem of recovering a finite structure of the target PFSA from nonsynchronizable process boils down to approximating atomic accumulation points of the set of causal states, which induces the concept of "-synchronization 33 .
A sequence x is "-synchronizing to the state q if the distribution } x on the state set Q induced by x satisfies k} x e q k I < ", where e q is the base vector with I on the entry indexed by q and H elsewhere. We show in Supplementary Text Sec. IV that there exists "-synchronizing sequence for arbitrarily small " > H. The importance of "-synchronizing sequence is twofold: 1) since T x a } T x e ¥, where e ¥ is the jQj ¢ j¦j matrix with the row indexed by q given by e @qA, a } x close to e q give rise to a x close to e @qA. And 2) although sequences prefixed by an "-synchronizing sequence to a state q may not remain "-synchronizing to state q, they are close to q on average. As an example, let us consider the non-synchronizable PFSA in Extended Data Fig. 6f over an alphabet of size Q. In Extended Data Fig. 6a and c, we show scatter plots of points @ x @HA; x @IAA and @ H S x @HA; H S x @IAA for jxj a Q and W. to be a good candidate for an almost synchronizing sequence.
Due to the finiteness of data, we cannot expect to distinguish causal states whose marginals on ¦ are too close. Thus, to make explicit our identifiability conditions, we require the transition probability e of the target PFSA to be -distinguishable, i:e: ke @qA e @q H Ak I > Vq q H P Q. Under these conditions, we show in Supplementary Text VIII a probabably approximately correct (PAC) 34 learnability for the class of synchronizable -separable PFSA with jQj M and min fe @q; A X e @q; A > Hg ! . In particular, assuming " min f=P; g and the length L for finding a synchronizing sequence is in the order of log j¦j I=", then we have P r h KL G G H ¡ > " ¡ < I " H (9) where G H is a PFSA inferred from a sequence generated by G and of minimum length n satisfying: n a O 2 I " P log I I MCL@"A 3 (10) We can also show that, assuming G is a synchronizable PFSA and the sequence x H found in Step 1 is a synchronizing sequence, then an inferred PFSA G H from a length n sequence generated by G satisfies See Extended Data Fig. 3 for a summary of the theoretical development discussed above for guaranteeing algorithmic performance as a directed graph.

Performance Analysis of xGenESeSS
The inference algorithm for XPFSA is called xGenESeSS, which takes as input two sequences x in , x out , and a hyperparameter ", and outputs an XPFSA in a manner very similar to the inference algorithm of PFSA. See Extended Data Algorithm 3 for detail.
While a PFSA models how the past of a time series influences its own future, a XPFSA models how the past of an input time series influences the future of an output time series. Hence, while in the SSC algorithm of PFSA, we identify sequences if they lead to futures that are statistically indistinguishable, in the SSC algorithm of XPFSA, we identify sequences if they lead to the same future distribution of the output.

Definition 2 (Crossed Probabilistic Finite-State Automaton (XPFSA)). A crossed probabilistic finite-state
automaton is specified by a quintuple @¦ in ; R; Y ¦ out ; A, where ¦ in is a finite input alphabet, R is a finite state set, is a partial function from R ¢ ¦ in to R called transition map, ¦ out is a finite output alphabet, and is a function from R to P ¦out called output probability map, where P ¦out is the space of probability distributions over ¦ out . In particular, @r; A is the probability of generating P ¦ out from a state r P R (See Supplementary Text Sec. VII).
Extended Data Fig. 4i gives an example of an XPFSA with R states. Note that a XPFSA has no transition probabilities defined between states as a PFSA does. The XPFSA in the example has a binary input alphabet and an output alphabet of size Q. The bar charts next to the R states of the XPFSA indicate the output probability distributions. To generate a sample path, an XPFSA requires an input sequence over its input alphabet.
Similar to the PFSA construction approach, here we compute the cross symbolic derivative, which is the ordered tuple P r@ jxA, with P ¦ out and a sequence x over ¦ in . We compute the empirical approximation of the cross symbolic derivative from sequences x in and x out as: x in ;xout y @A a number of in x out after y transpires in x in number of sub-sequence y in x in (12) Thus, xGenESeSS is almost identical to GenESeSS except that, in Step 1, xGenESeSS finds an almost synchronizing sequence based on cross symbolic derivatives, and in Step 2, identifies the transition structure based on the similarity between cross symbolic derivatives. Arguments for establishing the effectiveness of GenESeSS carry over to xGenESeSS with empirical symbolic derivative replaced by empirical cross symbolic derivative (See Supplementary Text Defn. 44).

Loss Function As Generalized KL Divergence Between Stochastic Processes
Deviation of deterministic functions may be measured in diverse metrics, leading to the necessity of a userdefined choice of loss functions in NNs. In contrast, the deviation between stochastic processes needs to be measured in terms of some quantification of the deviation of the associated finite dimensional distributions (FDD). Thus, any notion of a loss that we use must have the property that a zero loss indicates convergent FDDs. We quantify this loss via defining the notion of KL divergence between two PFSA, extending the wellknown information-theoretic measure of deviation of probability distributions to stochastic processes.
We show that the log-likelihood of a sequence being generated by a second PFSA converges to the sum entropy rate of the generating PFSA and the KL divergence of the second process from the actual generator (See Extended Data Fig. 5d-g). Performance of the GenESeSS is then measured in the term of KL divergence between the actual and inferred processes (See Sec. and Supplementary Text Sec. VIII). It is still possible to have different choices of the loss metric by defining functions of the KL divergence between processes. However, they all produce qualitatively similar results.

APPLICATION DETAILS: DATA SOURCE, PREPROCESSING & PERFORMANCE COMPARISON
We demonstrate five applications of FN modeling in complex spatio-temporal phenomena: predicting rare weather events in contiguous US, global seismic events with magnitude registering above the local third quartile, and forecasting urban crime in Atlanta GA, Chicago IL, and Philadelphia PA. These applications, enumerated in Tab. 1, 1) highlight the strictly superior performance of FN over LSTMs, 2) underlines the parsimony of the FN models, and 3) provide important insights into the underlying "physics" of the system at hand.
In each of these systems, we begin from a spatio-temporal event log, which enumerates events of interest, along with their space-time coordinates. For example, for US weather, the log is a culmination of events recorded in one of the PHHH airport-based weather stations as a part of the ASOS network logging extreme precipitation and cold/snow events. The seismic event log comes from the USGS hazards program, where we attempt to forecast events with a magnitude which is greater than the local third quartile of all events recorded within the past decade. For urban crime, we take a similar approach logging property crimes (consisting of burglary, theft etc.) and violent crimes (homicide, assault, battery etc.) within a couple of city blocks. In each application, we need to choose a spatial discretization to define our tiles which, for each event type, defines a distinct variable that we model and make predictions on. For example, in the weather prediction problem, we have PHHH spatial tiles, and three event categories (precipitation, cold/snow, and severity magnitude! Q 35 ), resulting in THHH variables. We eliminate some tiles which have event frequencies under I7, leaving us with SQIU variables, or time series sequences. In case of seismic events, we discretize the globe into Q ¢ Q tiles and consider events (event magnitude > average event magnitude), where the average is taken over all events recorded in the past decade above magnitude Q:W. As before, we eliminate tiles which are not seismically active leaving us with RUH variables or sequences. In case of urban crime, we follow a similar approach by covering the city with a grid spanning a couple of city blocks, and eliminating tiles which lack sufficient number of events.
This leads to TITS variables in Chicago, SIH sequences in Atlanta and IHQU sequences in Philadelphia. We also need to specify a temporal quantization, which specifies how one discrete step maps to continuous time intervals. The temporal quantization is tuned programmatically to maximize the average entropy rate of the event streams, which results in IP hour steps in case of the weather data, Q days for the seismic prediction, and I or P days for urban crime. With the exception of the precipitation event in case of the weather modeling problem, the event frequencies are all lower than IH%. We also choose how far into future we make predictions (prediction horizon), which is chosen to be I week for weather, R months for seismic prediction, and Q U days for urban crime (See Extended Data Tab. 1 for details).
Predictive ability of the FN framework is bench-marked against carefully tuned LSTM models 36  implementation with mean squared error as loss and adam as optimizer. Using cross-entropy (ce) as the loss function instead of mean square error did not produce significant difference in performance. We train LSTMs with the same data used for FNs, using binarized input sequences with H indicating absence of events and I the events of interest (See Methods: LSTM Comparison for details). We compare the achieved performance (See Tab. 1) of FN and LSTM architectures using 1) distribution of the area under the the receiver operating curve (AUC) for the spatial tiles in each application, 3) the precision-recall curves (PRC)

Performance Metrics
We use a flexible approach in evaluating AUC for both Fractal Net and LSTM; a positive prediction is treated as correct if there is at least one event recorded in ¦I time steps in the target spatial tile. We also account for the spatial variability of the exact event location in the evaluation of PRC by replacing predicted events by 2D Gaussian densities followed by the choice of a decision threshold.
More specifically, for a fixed time step t, we construct a risk map by summing 2D Gaussian densities centered at tiles with a positive prediction. Finally, a threshold z is chosen so that tiles with a risk level above z is reported as having a positive prediction. The standard deviation and the z-levels are chosen in a manner a threshold is selected on a ROC curve.  train LSTM models 41,42 using the tensorFlow.keras package 39 with SH hidden units, sigmoid function for activation, mean squared error (MSE) as loss function, IHH training epochs with batch size QP. Given our loss function, the LSTM model should learn to output the probability of producing I as the next output symbol. In the bottom row, we show the scatter plots of probability of I-predictions of LSTM models vs input length. We color-code the dots in the scatter plots for the two processes of finite Markov order (Panel a and b), and the one generated by the three-state model (Panel c). If we know the current input leads to a certain state, we color the next output with the color of the state. The process in panel d has no finite Markov order and its generating model is not synchronizable. The predictions show that the LSTM fails to uncover the I-probability distribution.      show the output probability vectors @riA, i a H;I;P;Q. For example, at state rH, the probability of getting symbol a, b, and c as output is :P, :Q, and :S, respectively. We note that an XPFSA doesn't have a transition probability map as a PFSA does. Panel j: the synchronous composition of the MP PFSA in Fig. 4a to the XPFSA in Panel a. We show the composition over fqH; qIg ¢ frH; rI; rP; rQg but highlight only the strongly connected component that is kept for the synchronous composition. e. g. f.
Extended Data Fig. 6. Examples in Supplementary Text Sec. VIII For the non-synchronizable Q-state Q-symbol PFSA G in Panel f, and for sequence length d a Q and W, we show in Panel a the scatter plots of fx@H; IA X jxj a dg, in Panel b the density plots of the previous set with the weight of the point x@H; IA being pG@xA, in Panel c the scatter plots of f H 5 x @H;IA X jxj a dg, and in Panel d the density plots of the previous set with the weight of the point H 5 x @H;IA being pG@xjH S A. Panel e is a PFSA with S states and over binary alphabet, and Panel g is a running tree of GenESeSS in case xH find in Extended Data Algorithm 2 line 4 is a synchronizing sequence to state qH. A new state is colored orange, and a repeated state is colored gray.