Multi-Resolution Continuous Normalizing Flows

Recent work has shown that Neural Ordinary Differential Equations (ODEs) can serve as generative models of images using the perspective of Continuous Normalizing Flows (CNFs). Such models offer exact likelihood calculation, and invertible generation/density estimation. In this work we introduce a Multi-Resolution variant of such models (MRCNF), by characterizing the conditional distribution over the additional information required to generate a fine image that is consistent with the coarse image. We introduce a transformation between resolutions that allows for no change in the log likelihood. We show that this approach yields comparable likelihood values for various image datasets, with improved performance at higher resolutions, with fewer parameters, using only 1 GPU. Further, we examine the out-of-distribution properties of (Multi-Resolution) Continuous Normalizing Flows, and find that they are similar to those of other likelihood-based generative models.


Introduction
Reversible generative models derived through the use of the change of variables technique [16,42,25,83] are growing in interest as alternatives to generative models based on Generative Adversarial Networks (GANs) [20] and Variational Autoencoders (VAEs) [40].While GANs and VAEs have been able to produce visually impressive samples of images, they have a number of limitations.A change of variables approach facilitates the transformation of a simple base probability distribution into a more complex model distribution.Reversible generative models using this technique are attractive because they enable efficient density estimation, efficient sampling, and computation of exact likelihoods.
A promising variation of the change-of-variable approach is based on the use of a continuous time variant of normalizing flows [9,21,18], which uses an integral over continuous time dynamics to transform a base distribution into the model distribution, called Continuous Normalizing Flows (CNF).This approach uses ordinary differential equations (ODEs) specified by a neural network, or Neural ODEs.CNFs have been shown to be capable of modelling complex distributions such as those associated with images.
While this new paradigm for the generative modelling of images is not as mature as GANs or VAEs in terms of the generated image quality, it is a promising direction of research as it does not have some key shortcomings associated with GANs and VAEs.Specifically, GANs are known to suffer from mode-collapse [49], and are notoriously difficult to train [2] compared to feed forward networks because their adversarial loss seeks a saddle point instead of a local minimum [4].CNFs are trained by mapping images to noise, and their reversible architecture allows images to be generated by going in reverse, from noise to images.This leads to fewer issues related to mode collapse, since any input example in the dataset can be recovered from the flow using the reverse of the transformation learned during training.VAEs only provide a lower bound on the marginal likelihood whereas CNFs provide Figure 1: The architecture of our Multi-Resolution Continuous Normalizing Flow (MRCNF) method (best viewed in color).Continuous normalizing flows (CNFs) g s are used to generate images x s from noise z s at each resolution, with those at finer resolutions conditioned (dashed lines) on the coarser image one level above x s+1 , except at the coarsest level where it is unconditional.Every finer CNF produces an intermediate image y s , which is then combined with the immediate coarser image x s+1 using a linear map M from eq. ( 8) to form x s .The multiscale maps are defined by eq. ( 15).exact likelihoods.Despite the many advantages of reversible generative models built with CNFs, quantitatively such methods still do not match the widely used Fréchet Inception Distance (FID) scores of GANs or VAEs.However their other advantages motivate us to explore them further.
Furthermore, state-of-the art GANs and VAEs exploit the multi-resolution properties of images, and recent top-performing methods also inject noise at each resolution [5,72,38,77].While shaping noise is fundamental to normalizing flows, only recently have normalizing flows exploited the multi-resolution properties of images.For example, WaveletFlow [83] splits an image into multiple resolutions using the Discrete Wavelet Transform, and models the average image at each resolution using a normalizing flow.While this method has advantages, it suffers from many issues such as high parameter count and long training time.
In this work, we consider a non-trivial multi-resolution approach to continuous normalizing flows, which fixes many of these issues.A high-level view of our approach is shown in Figure 1.Our main contributions are: 2 Background

Normalizing Flows
Normalizing flows [75,33,16,64,44] are generative models that map a complex data distribution, such as real images, to a known noise distribution.They are trained by maximizing the log likelihood of their input images.Suppose a normalizing flow g produces output z from an input x i.e. z = g(x).The change-of-variables formula provides the likelihood of the image under this transformation as: The first term on the right (log determinant of the Jacobian) is often intractable, however, previous works on normalizing flows have found ways to estimate this efficiently.The second term, log p(z), is computed as the log probability of z under a known noise distribution, typically the standard Gaussian N (0, I).

Wavelet Flow [83]
WaveletFlow splits an image using the Discrete Wavelet Transformation, and maps the average image at each resolution to noise using a normalizing flow.WaveletFlow builds on the Glow [42] architecture.It uses an orthogonal transformation, which does not preserve range, and adds a constant term to the log likelihood at each resolution.Best results are obtained when WaveletFlow models with a high parameter count are trained for a long period of time.We aim to fix these issues using our MRCNF.

Continuous Normalizing Flows
Continuous Normalizing Flows (CNF) [9,21,18] are a variant of normalizing flows that operate in the continuous domain.A CNF creates a geometric flow between the input and target (noise) distributions, by assuming that the state transition is governed by an Ordinary Differential Equation (ODE).It further assumes that the differential function is parameterized by a neural network, this model is called a Neural ODE [9].Suppose CNF g transforms its state v(t) using a Neural ODE, with neural network f defining the differential.Here, v(t 0 ) = x is, say, an image, and at the final time step v(t 1 ) = z is a sample from a known noise distribution.
This integration is typically performed by an ODE solver.Since this integration can be run backwards as well to obtain the same v(t 0 ) from v(t 1 ), a CNF is a reversible model.
Equation 1 can be used to compute the change in log-probability induced by the CNF.However, Chen et al. [9] and Grathwohl et al. [21] proposed a more efficient variant in the context of CNFs, called the instantaneous change-of-variables formula: Hence, the change in log-probability of the state of the Neural ODE i.e. ∆ log p v is expressed as another differential equation.The ODE solver now solves both differential equations eq. ( 2) and eq.( 3) by augmenting the original state with the above.Thus, a CNF provides both the final state v(t 1 ) as well as the change in log probability ∆ log p v(t0)→v(t1) together.
Prior works [21,18,19,62,32] have trained CNFs as reversible generative models of images, by maximizing the likelihood of the images under the model: where x is an image, z and ∆ log p x→z are computed by the CNF using eq.( 2) and eq.( 3), and log p(z) is the likelihood of the computed z under a known noise distribution, typically the standard Gaussian N (0, I).Novel images are generated by sampling z from the known noise distribution, and running it through the CNF in reverse.

Our method
Our method is a reversible generative model of images that builds on top of CNFs.We introduce the notion of multiple resolutions in images, and connect the different resolutions in an autoregressive fashion.This helps generate images faster, with better likelihood values at higher resolutions, using only one GPU in all our experiments.We call this model Multi-Resolution Continuous Normalizing Flow (MRCNF).

Multi-Resolution image representation
Multi-resolution representations of images have been explored in computer vision for decades [7,56,80,6,54,50].This implies that much of the content of an image at a resolution is a composition of low-level information captured at coarser resolutions, and some high-level information not present in the coarser images.We take advantage of this property by first decomposing an image in resolution space i.e. by expressing it as a series of S images at decreasing resolutions: x → (x 1 , x 2 , . . ., x S ), where x 1 = x is the finest image, x S is the coarsest, and every x s+1 is the average image of x s .This called an image pyramid, or a Gaussian Pyramid if the upsampling-downsampling operations include a Gaussian filter [7,6,1,80,50].In this work, we obtain a coarser image simply by averaging pixels in every 2×2 patch, thereby halving the width and height.
However, this representation is redundant since much of the information in x 1 is contained in x s>1 .Instead, we express x as a series of high-level information y s not present in the immediate coarser images x s+1 , and a final coarse image x S : Our overall method is to map these S terms to S noise samples using S CNFs.

3.2
Defining the high-level information y s We choose to design a linear transformation with the following properties: 1) invertible i.e. it should be possible to deterministically obtain x s from y s and x s+1 , and vice versa ; 2) volume preserving i.e. determinant is 1, change in log-likelihood is 0 ; 3) angle preserving ; and 4) range preserving (under the notion of the maximum principle [79]).
Consider the simplest case of 2 resolutions where x 1 is a 2×2 image with pixel values x 1 , x 2 , x 3 , x 4 , and x 2 is a 1×1 image with pixel value x = 1 4 (x 1 + x 2 + x 3 + x 4 ).We require three values (y 1 , y 2 , y 3 ) = y 1 that contain information not present in x 2 , such that x 1 is obtained when y 1 and x 2 are combined.This could be viewed as a problem of finding a matrix M such that: , since every pixel value in x 1 depends on x.Finding the rest of the parameters can be viewed as requiring four 3D vectors that are spaced such that they do not degenerate the number of dimensions of their span.These can be considered as the four corners of a tetrahedron in 3D space, under any configuration (rotated in 3D space), and any scaling of the vectors (see Figure 2).
Out of the many possibilities for this tetrahedron, we could choose the matrix that performs the Discrete Haar Wavelet Transform [54,55]: However, this has log det(M −1 ) = log(1/2) (eq.( 6)), and is therefore not volume preserving.
Other simple scaling of eq. ( 6) has been used in the past, for example multiplying the last row of eq. ( 6) by 2, yielding an orthogonal transformation, such as in WaveletFlow [83].However, this transformation neither preserves the volume i.e. the log determinant is not 0, nor the maximum i.e. the range of x s changes.
We wish to find a transformation M where: one of the results is the average of the inputs, x; it is unit determinant; the columns are orthogonal; and it preserves the range of x.Fortunately such a matrix exists -although we have not seen it discussed in prior literature.It can be seen as a variant of the Discrete Haar Wavelet Transformation matrix that is unimodular, i.e. has a determinant of 1 (and is therefore volume preserving), while also preserving the range of the images for the input and its average: where c = 2 2/3 , a = 4. Hence, log det(M −1 ) = log(1) = 0.This can be scaled up to larger spatial regions by performing the same calculation for each 2×2 patch.Let M be the function that uses matrix M from above and combines every pixel in x s+1 with the three corresponding pixels in y s to make the 2×2 patch at that location in x s using eq.( 7): Equation 1 can be used to compute the change in log likelihood from this transformation x s → (y s , x s+1 ): where log det(M −1 ) = dims(x s+1 ) log(1/2) in the case of eq. ( 6), where "dims" is the number of pixels times the number of channels (typically 3) in the image, and log det(M −1 ) = 0 for eq.( 7).

Multi-Resolution Continuous Normalizing Flows
Using the multi-resolution image representation in eq. ( 5), we characterize the conditional distribution over the additional degrees of freedom (y s ) required to generate a higher resolution image (x s ) that is consistent with the average (x s+1 ) over the equivalent pixel space.At each resolution s, we use a CNF to reversibly map between y s (or x S when s=S) and a sample z s from a known noise distribution.For generation, y s only adds 3 degrees of freedom to x s+1 , which contain information missing in x s+1 , but conditional on it.
This framework ensures that one coarse image could generate several potential fine images, but these fine images have the same coarse image as their average.This fact is preserved across resolutions.Note that the 3 additional pixels in y s per pixel in x s+1 are generated conditioned on the entire coarser image x s+1 , thus maintaining consistency using the full context.
In principle, any generative model could be used to map between the multi-resolution image and noise.Normalizing flows are good candidates for this as they are probabilistic generative models that perform exact likelihood estimates, and can be run in reverse to generate novel data from the model's distribution.This allows model comparison and measurement of generalization to unseen data.We choose to use the CNF variant of normalizing flows at each resolution.CNFs have recently been shown to be effective in modeling image distributions using a fraction of the number of parameters typically used in normalizing flows (and non flow-based approaches), and their underlying framework of Neural ODEs have been shown to be more robust than convolutional layers [82].
Training: We train an MRCNF by maximizing the average log-likelihood of the images in the training dataset under the model.The log probability of each image log p(x) can be estimated recursively from eq. ( 9) as: where ∆ log p xs→(ys,xs+1) is given by eq. ( 9), log p(y s | x s+1 ) and log p(x S ) are given by eq. ( 4): The coarsest resolution S can be chosen such that the last CNF operates on the image distribution at a small enough resolution that is easy to model unconditionally.All other CNFs are conditioned on the immediate coarser image.The conditioning itself is achieved by concatenating the input image of the CNF with the coarser image.This model could be seen as a stack of CNFs connected in an autoregressive fashion.
Typically, likelihood-based generative models are compared using the metric of bits-per-dimension (BPD), i.e. the negative log likelihood per pixel in the image.Hence, we train our MRCNF to minimize the average BPD of the images in the training dataset, computed using eq.( 13): We use FFJORD [21] as the baseline model for our CNFs.In addition, we use to two regularization terms introduced by RNODE [18] to speed up the training of FFJORD models by stabilizing the learnt dynamics: the kinetic energy of the flow K(θ), and the Jacobian norm B(θ): Parallel training: Note that although the final log likelihood log p(x) involves sequentially summing over values returned by all S CNFs, the log likelihood term of each CNF is independent of the others.Conditioning is done using ground truth images.Hence, each CNF can be trained independently, in parallel.
Generation: Given an S-resolution model, we first sample z s , s = 1, . . ., S from the latent noise distributions.The CNF g s at resolution s transforms the noise sample z s to high-level information y s conditioned on the immediate coarse image x s+1 (except g S which is unconditioned).y s and x s+1 are then combined to form x s using M from eq. ( 7).This process is repeated progressively from coarser to finer resolutions, until the finest resolution image x 1 is computed (see Figure 1).It is to be noted that the generated image at one resolution is used to condition the CNF at the finer resolution.

Multi-Resolution Noise
We further decompose the noise image as well into its respective coarser components.This means that ultimately we use only one noise image at the finest level, but it is decomposed into multiple resolutions using eq.( 7).x s+1 is mapped to noise of a quarter variance, while y s is mapped to noise of c-factored variance (see fig. 1).Although this is optional, it preserves interpretation between the single-and multi-resolution models.

Related work
Multi-resolution approaches already serve as a key component of state-of-the-art GAN [15,37,36] and VAE [68,77] based deep generative models.Deconvolutional CNNs [51,67] use upsampling layers to generate images more effectively.Modern state-of-the-art generative models have also injected noise at different levels to improve sample quality [5,38,77].
Although they achieve great results in terms of BPD and image quality, they nonetheless report results from significantly higher number of parameters (some with 100x!), and several times GPU hours of training.
STEER [19] introduced temporal regularization to CNFs by making the final time of integration stochastic.However, we found that this increased training time without significant BPD improvement.
Comparison to WaveletFlow: We emphasize that there are important and crucial differences between our MRCNF and WaveletFlow.We generalize the notion of a multi-resolution image representation (section 3.2), and show that Wavelets are one case of this general formulation.WaveletFlow builds on the Glow [42] architecture, while ours builds on CNFs [21,18].We also make use of the notion of multi-resolution decomposition of the noise, which is optional, but is not taken into account by WaveletFlow.WaveletFlow uses an orthogonal transformation which does not preserve range ; our MRCNF uses eq. ( 7) which is volume-preserving and range-preserving.Finally, WaveletFlow applies special sampling techniques to obtain better samples from its model.We have so far not used such techniques for generation, but we believe they can potentially help our models as well.By making these important changes, we fix many of the previously discussed issues with WaveletFlow.For a more detailed ablation study, please check subsection 5.1.
"Multiple scales" in prior normalizing flows: Normalizing flows [16,42,21] try to be "multi-scale" by transforming the input in a smart way (squeezing operation) such that the width of the features progressively reduces in the direction of image to noise, while maintaining the total dimensions.This happens while operating at a single resolution.In contrast, our model stacks normalizing flows at multiple resolutions in an autoregressive fashion by conditioning on the images at coarser resolutions.

Experimental results
Table 1: Bits-per-dimension (lower is better) of images in the corresponding evaluation sets for CIFAR10, ImageNet 32×32, and ImageNet 64×64.We also report the number of parameters in the models, and the time taken to train (in GPU hours).All our models were trained on only one GPU.We train MRCNF models on the CIFAR10 [45] dataset at finest resolution of 32x32, and the ImageNet [14] dataset at 32x32, 64x64, 128x128.We build on top of the code provided in Finlay et al. [18] 1 .In all cases, we train using only one NVIDIA RTX 20280 Ti GPU with 11GB.
In Table 5, we compare our results with prior work in terms of (lower is better in all cases) the BPD of the images of the test datasets under the trained models, the number of parameters used by the model, and the number of GPU hours taken to train.The most relevant models for comparison are the 1-resolution FFJORD [21] models, and their regularized version RNODE [18], since our model directly converts their architecture into multi-resolution.Other relevant comparisons are previous flow-based methods [16,42,74,25,83], however their core architecture (RealNVP [16]) is quite different from FFJORD.
BPD: At lower resolution spaces, we achieve comparable BPDs in lesser time with far fewer parameters than previous normalizing flows (and non flow-based approaches).However, the power of the multi-resolution formulation is more evident at higher resolutions: we achieve better BPD for ImageNet64 with significantly fewer parameters and lower time using only one GPU.
It is to be noted that we were not able to reproduce the same BPD as provided by STEER [19], we report the results of our re-implementation.A more complete table can be found in the appendix.
Train time: All our experiments used only one GPU, and took significantly less time to train than 1-resolution CNFs, and all prior works including flow-based and non-flow-based models.Since all the CNFs can be trained in parallel, the actual training time in practice could be much lower than reported.2) with just 2.74M parameters in ≈60 GPU hours.

Ablation study
Our MRCNF method differs from WaveletFlow in three respects: 1. we use CNFs, 2. we use eq.( 7) instead of eq. ( 6) as used by WaveletFlow, 3. we use multi-resolution noise.We check the individual effects of these changes in an ablation study in Table 3, and conclude that: Table 3: Ablation study across using Wavelet in eq. ( 6), and multi-resolution noise formulation in 3.4.

CIFAR10
IMAGENET64 BPD PARAM TIME BPD PARAM TIME WaveletFlow [83] 3.78 98.0M 822.00 1-resolution CNF (RNODE) [18] 3.38  7) instead of the original Wavelet Transformation of eq. ( 6) not only improves the BPD, it also consistently decreases training time.3.As expected, the use of multi-resolution noise does not have a critical impact on either BPD or training time.We use it anyway so as to retain interpretation with 1-resolution models.
Thus, our MRCNF model is not a trivial replacement of normalizing flows with CNFs in WaveletFlow.
We generalize the notion of multi-resolution image representation, in which the Discrete Wavelet Transform is one of many possibilities.We then derived a unimodular transformation that adds no change in likelihood.
6 Examining Out-of-Distribution behaviour The derivation of likelihood-based models suggests that the density of an image under the model is an effective measure of its likelihood of being in distribution.However, recent works [76,58,71,59] have pointed out that it is possible that images drawn from other distributions have higher model likelihood.Examples have been shown where normalizing flow models (such as Glow) trained on CIFAR10 images assign higher likelihood to SVHN [60] images.This could have serious implications on the practical applicability of these models.Some also note that likelihoodbased models do not generate images with good sample quality as they avoid assigning small probability to out-of-distribution (OoD) data points, hence using model likelihood (-BPD) for detecting OoD data is not effective.
We conduct the same experiments with (MR)CNFs, and find that similar conclusions can be drawn.Figure 4 plots the histogram of log likelihood per dimension (-BPD) of OoD images (SVHN, TinyImageNet) under MRCNF models trained on CIFAR10.It can be observed that the likelihood of the OoD SVHN is higher than CIFAR10 for MRCNF, similar to the findings for Glow, PixelCNN, VAE in earlier works [58,13,71,59,43].
One possible explanation put forward by Nalisnick et al. [59] is that "typical" images are less "likely" than constant images, which is a consequence of the distribution of a Gaussian in high dimensions.Indeed, as our Figure 4 shows, constant images have the highest likelihood under MRCNFs, while randomly generated (uniformly distributed) pixels have the least likelihood (not shown in figure due to space constraints).
Choi et al. [13], Nalisnick et al. [59] suggest using "typicality" as a better measure of OoD.However, Serrà et al. [71] observe that the complexity of an image plays a significant role in the training of likelihood-based generative models.They propose a new metric S as an out-of-distribution detector: where L(x) is the complexity of an image x measured as the length of the best compressed version of x (we use FLIF [73] following Serrà et al. [71]) normalized by the number of dimensions.We perform a similar analysis as Serrà et al. [71] to test how S compares with -bpd for OoD detection.For different MRCNF models trained on CIFAR10, we compute the area under the receiver operating characteristic curve (auROC) using -bpd and S as standard evaluation for the OoD detection task [24,71].
Table 4 shows that S does perform better than -bpd in the case of (MR)CNFs, similar to the findings in Serrà et al. [71] for Glow and PixelCNN++.It seems that SVHN is easier to detect as OoD for Glow than MRCNFs.However, OoD detection performance is about the same for TinyImageNet.We also observe that MRCNFs are better at OoD than CNFs.

Shuffled in-distribution images
Kirichenko et al. [43] conclude that normalizing flows do not represent images based on their semantic contents, but rather directly encode their visual appearance.We verify this for continuous normalizing flows by estimating the density of in-distribution test images, but with patches of pixels randomly shuffled.Figure 5

Conclusion
We have presented a Multi-Resolution approach to Continuous Normalizing Flows (MRCNF).MRCNF models achieve comparable or better performance in significantly less training time, training on a single GPU, with a fraction of the number of parameters of other competitive models.Although the likelihood values for 32×32 resolution datasets such as CIFAR10 and ImageNet32 do not improve over the baseline, ImageNet64 and above see a marked improvement.The performance is better for higher resolutions, as seen in the case of ImageNet128.We also conducted an ablation study to note the effects of each change we introduced in the formulation.
In addition, we show that (Multi-Resolution) Continuous Normalizing Flows have similar out-ofdistribution properties as other Normalizing Flows.
In terms of broader social impacts of this work, generative models of images can be used to generate so-called fake images, and this issue has been discussed at length in other works.We emphasize lower computational budgets, and show comparable performance with far fewer parameters and less training time.
A Full Table 1

C Simple example of density estimation
For example, if we use Euler method as our ODE solver, for density estimation Equation 2 reduces to: where f s is a neural network, t 0 represents the "time" at which the state is image x, and t 1 is when the state is noise z.We start at scale S with an image sample x S , and assume t 0 and t 1 are 0 and 1 respectively: . . .

D Simple example of generation
For example, if we use Euler method as our ODE solver, for generation Equation 2 reduces to: i.e. the state is integrated backwards from t 1 (i.e.z s ) to t 0 (i.e.x s ).We start at scale 0 with a noise sample z 0 , and assume t 0 and t 1 are 0 and 1 respectively: x 0 = z 0 − f 0 (z 0 , t 1 ) x 1 = z 1 − f 1 (z 1 , t 1 | x 0 ) . . .

E Models
We used the same neural network architecture as in RNODE [18].The CNF at each resolution consists of a stack of bl blocks of a 4-layer deep convolutional network comprised of 3x3 kernels and softplus activation functions, with 64 hidden dimensions, and time t concatenated to the spatial input.
In addition, except at the coarsest resolution, the immediate coarser image is also concatenated with the state.The integration time of each piece is [0, 1].The number of blocks bl and the corresponding total number of parameters are given in Table 6.

F Gradient norm
In order to avoid exploding gradients, We clipped the norm of the gradients [66] by a maximum value of 100.0.In case of using adversarial loss, we first clip the gradients provided by the adversarial loss by 50.0, sum up the gradients provided by the log-likelihood loss, and then clip the summed gradients by 100.0.

Figure 4 :
Figure 4: Histogram of log likelihood per dimension of out-of-distribution datasets (TinyImageNet, SVHN, Constant) under (MR)CNF models trained on CIFAR10.As with other likelihood-based generative models such as Glow & Pixel-CNN, OoD datasets have higher likelihood under (MR)CNFs.
(a)  shows an example of images of shuffled patches of varying size, Figure5 (b)shows the graph of the their log-likelihoods.That shuffling pixel patches would render the image semantically meaningless is reflected in the Fréchet Inception Distance (FID) between CIFAR10-Train and these sets of shuffled images -1x1: 340.42, 2x2: 299.99, 4x4: 235.22,8x8: 101.36, 16x16: 33.06, 32x32 (i.e.CIFAR10-Test): 3.15.However, we see that images with large pixel patches shuffled are quite close in likelihood to the unshuffled images, suggesting that since their visual content has not changed much they are almost as likely as unshuffled images under MRCNFs.

G 8 -
bit to uniform The change-of-variables formula gives the change in probability due to the transformation of u to v: log p(u) = log p(v) + log det dv du Specifically, the change of variables from an 8-bit image to an image with pixel values in range [0p(a S ) = log p(b S ) + log det db da =⇒ log p(a S ) = log p(b S ) + log 1 256 D S [18]k spaces indicate unreported values.‡Asreportedin[19].§Re-implementedby us.'x':Fails to train.*RNODE[18]used 4 GPUs to train on ImageNet64.

Table 2 :
Metrics for unconditional ImageNet128 generation.-resolution MRCNF 3.31 ±0.69 2.74M 58.59 Simply replacing the normalizing flows in WaveletFlow with CNFs does not produce the best results.It does improve the BPD and training time compared to WaveletFlow.2. Using our unimodular transformation in eq. (

Table 5 :
[18]nditional image generation metrics (lower is better in all cases): number of parameters in the model, bits-per-dimension, time (in hours).Most previous models use multiple GPUs for training, all our models were trained on only one NVIDIA V100 GPU.‡As reported in[19].*FFJORDRNODE[18]used 4 GPUs to train on ImageNet64.'x': Fails to train.

Table 6 :
Number of parameters for different models with different total number of resolutions (res), and the number of channels (ch) and number of blocks (bl) per resolution.