Matrix Completion of World Trade

This work applies Matrix Completion (MC) -- a class of machine-learning methods commonly used in the context of recommendation systems -- to analyse economic complexity. MC is applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, whose elements express the relative advantage of countries in given classes of products, as evidenced by yearly trade flows. A high-accuracy binary classifier is derived from the application of MC, with the aim of discriminating between elements of the RCA matrix that are, respectively, higher or lower than one. We introduce a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) based on MC, which is related to the predictability of countries' RCA (the lower the predictability, the higher the complexity). Differently from previously-developed indices of economic complexity, the MONEY index takes into account the various singular vectors of the matrix reconstructed by MC, whereas other indices are based only on one/two eigenvectors of a suitable symmetric matrix, derived from the RCA matrix. Finally, MC is compared with a state-of-the-art economic complexity index (GENEPY). We show that the false positive rate per country of a binary classifier constructed starting from the average entry-wise output of MC can be used as a proxy of GENEPY.

is collected, for each year, in a matrix, RCA ∈ R C×P , where C is the number of countries considered, and P is the number of products examined (at a given aggregation level). In formulas, one has RCA c,p := where D c,p is the return in international dollars of the exports of the product p by country c. 1 In the paper, MC is applied several times (starting from different training subsets of suitably discretized RCA values associated with several countries and products, excluding originally NaN values) to estimate the expected RCA values of pairs of countries c and products p that have not been used in the training phase. To fulfill this task, the adopted MC technique is based on a soft-thresholded SVD, which selects each time -via a suitable regularization technique -the subset of most informative singular values and corresponding singular vectors. The predictions provided by MC are then exploited to construct two surrogate incidence matrices, one of which is used to compute a novel index of economic complexity, and the other one is used as an input to the GENEPY algorithm (Sciarra et al., 2020) 15 .
The work contributes to the literature on economic complexity in three ways: (i) it applies for the first time MC to assess the complexity of countries; (ii) it defines a novel index of economic complexity based on MC; (iii) it builds up a comparison with a state-of-the-art index of economic complexity (GENEPY), revealing a high correlation between the output of GENEPY when it is applied to the original incidence matrix and the false positive rate of a binary classifier derived by the repeated application of MC. The results of our analysis show that MC performs well in estimating the RCA of countries. Supported by the high quality predictions of MC, we propose a novel Matrix cOmpletion iNdex of Economic complexitY (MONEY) for countries, which exploits the accuracy of their RCA predictions derived from the repeated applications of MC. Such accuracy is expressed in terms of a suitably weighted Area Under the Curve (AUC), one for each country examined. The MONEY index ranks countries according to their predictability, taking into account also the complexity of the products. Specifically, the larger the AUC for a specific country and the larger the average with respect to a subset of the products of that country of the MC performance in estimating the discretized RCA values of country-product pairs, the less complex that country. Using MC to construct the proposed index helps to solve the shortcoming of GENEPY, i.e., the fact that, differently from MC, GENEPY takes into account only the information coming from two eigenvectors. Moreover, the GENEPY index computed using the MC surrogate incidence matrix reveals interesting discrepancies in terms of economic complexity with respect to the original GENEPY, i.e., the one calculated starting from the incidence matrix associated with the observed RCA matrix. By considering multiple years of data (see the Supplemental), we find a strong and significant positive correlation between the false positive rate of the binary classifier derived from thresholding the average output of MC and the original GENEPY index.

Predicting the revealed competitive advantage of countries: a matrix completion approach
In this work, we apply Matrix Completion (MC) techniques to study economic complexity. This class of machine-learning methods has been popularized by the so-called Netflix competition (see the Appendix for further details on MC and Hastie et al.  3 for some of its applications). This paper uses MC to estimate the expected revealed competitive advantage (RCA) of countries c and products p. The specific MC method adopted in the paper consists in completing a partially observed matrix A ∈ R C×P (which is derived from the RCA matrix in our case), by minimizing a suitable trade-off between the reconstruction error of the known portion of that matrix and a penalty term, which penalizes a high nuclear norm of the reconstructed (or completed) matrix. This is formulated via the following optimization problem 13 : where Ω tr is a training subset of pairs of indices (c, p) corresponding to positions of known entries of the partially observed matrix A ∈ R C×P , Z ∈ R C×P is the completed matrix (to be optimized), λ ≥ 0 is a regularization constant (chosen by a suitable validation method), and Z * is the nuclear norm of the matrix Z, i.e., the sum of all its singular values. The reader if referred to the Appendix for further technical details on the optimization problem (2) and on the algorithm we used to solve it. While MC has already found many applications in many fields (e.g., movie recommendation, sensor engineering, econometrics), to the best of our knowledge, this is the first time it is used to analyze economic complexity. More precisely, we applied MC to define a novel complexity index to be compared with state-of-the-art complexity indices. In our application of MC to economic complexity, the MC optimization problem (2) was solved several times by a specific algorithm previously developed for that purpose (named Soft Impute 13 , see the Appendix), for different choices of the regularization parameter λ and of the subset Ω tr (detailed later in this section ∈ R C×P to be compared with the GENEPY index computed using the original incidence matrix M . In the following, we describe our approach of applying MC to the reconstruction of the RCA matrix for the case in which the products were aggregated at the 4-digits level in the Harmonized System Codes 1992 (HS-1992). Consistently with the literature 14 , we constructed the matrix A (one of the inputs to the optimization problem (2)) by discretizing the elements of the RCA matrix. For the sake of brevity, we refer to the MC application to the definition of a measure of complexity of the countries. To get a measure of complexity of the products, it is enough to replace the matrix A with its transpose (see also the Supplemental for some related results). 1. For the matrix A ∈ R C×P (where C = 119 is the number of countries, and P = 1243 is the number of products), the MC optimization problem (2) was solved N = 1000 times by the Soft Impute algorithm, based on various choices for the training/validation/test sets (and, as already mentioned, for the regularization parameter λ ). 2. For each such repetition n = 1, . . . , N, the sets above were constructed as follows. First, a (pseudo)random permutation of the rows of A was generated. Then, a subset S n of these rows was considered, by including in it the first row in the permutation and the successive s% 25% rows. In this way, the resulting number of elements of the set S n was |S n | = 30. Next, for each row in S n , its elements belonging to all the groups except group "0" were obscured independently with probability p missing = 0.3. The (indices ot the) remaining entries of the matrix A (excluding the ones belonging to the group "0") formed the training set (denoted by Ω tr n ). The obscured entries in one of the |S n | rows (say, row h ∈ {1, . . . , |S n |}) formed the test set (denoted by Ω test n,h ), whereas the obscured entries in the remaining |S n | − 1 rows formed the validation set (denoted by Ω val n,h ). 3. For each repetition n, the generation of the validation and test sets from the set S n was made |S n | times, each time with a different selection of the row h associated with the test set (and, as a consequence, also of the |S n | − 1 rows associated with the validation set). Hence, the same training set was associated with |S n | different pairs of validation and test sets 3 . In this way, for each choice of S n and of the regularization parameter λ , the MC optimization problem (2) was solved once instead of |S n | times, thus improving the computational efficiency. Finally, by construction, each time there was no overlap between the training, validation, and test sets. 4. To avoid overfitting, for each choice of the training set Ω tr n , the optimization problem (2) was solved for 30 choices λ k for λ , exponentially distributed as λ k = 2 (k−1)/2 for k = 1, . . . , 30. The resulting completed and post-processed matrix was indicated as Z (n) λ k . Then, for each λ k and each of the |S n | selections of the validation sets associated with the same training set, the Root Mean Square Error (RMSE) of matrix reconstruction on that validation set was computed as then the choice λ k • (n,h) minimizing RMSE val n,h λ k for k = 1, . . . , 30 was found. Finally, the RMSE of matrix reconstruction on the related test set was computed in correspondence of the so-obtained optimal value λ k • (n,h) as 5. For each choice of n and h, the MC predictions contained in the matrix Z were used to build a binary classifier.
More precisely, each time an element A c,p of the matrix A was in the test set, such element was attributed to the class 0 (corresponding to the case 0 ≤ RCA < 1) when its MC prediction from Z was lower than 0, otherwise it was attributed 2 The reader is referred to the Appendix for details on how the incidence matrix M is defined, starting from the RCA matrix. 3 The number of repetitions N = 1000 and the percentage s% 25% were selected in order to associate each row with the test set a sufficiently large number of times, with high probability. In particular, with these choices, the average number of times each row was associated with the test set was about 250.

3/21
to the class 1 (corresponding to the case RCA ≥ 1). Finally, the average classification of the element A c,p (with respect to all the test sets to which that element belonged) was indicated as A (MC) c,p ∈ [0, 1], whereas its most frequent classification (either 0 or 1) was indicated asÂ (MC) c,p . A random assignment between 0 and 1 was made to deal with ties. In the (unlikely) case the element A c,p appeared in none of the test sets 4 c,p were chosen to be equal to 0. 6. A first MC surrogate M (MC) ∈ R 119×1243 of the incidence matrix M was defined as follows: Similarly, a second MC surrogateM (MC) ∈ R 119×1243 of the incidence matrix M was defined as follows: c,p , otherwise . , was generated. In order to assess the prediction capability of the binary classifier associated with MC (see Step 5 above), for each row (country) c of A, we also computed the false positive rate f pr c and the false negative rate f nr c as the average classification error frequency, respectively, of the true negative/true positive examples in all the test sets associated with that row (where the "negative class" refers to the class 0 associated with 0 ≤ RCA < 1, and the "positive class" to the class 1 associated with RCA ≥ 1).

The Matrix cOmpletion iNdex of Economic complexitY (MONEY)
In this section, we introduce our proposed economic complexity index, called Matrix cOmpletion iNdex of Economic complexitY (MONEY), whose construction is based on MC.
The MONEY index is built starting from the matrix M (MC) introduced in Section 2. It is based on constructing a binary classifier for each country by combining the corresponding row of M (MC) with a threshold, then assessing the performance of the resulting MC classifications at the level of each country. First, for the binary classifier associated with each country, a Receiver Operating Characteristic (ROC) curve 5 (denoted as ROC c ) is constructed, based on a country-dependent threshold. The corresponding Area Under the Curve (AUC) 6 is denoted as AUC c . In more details, for each country c, the elements of the c-th row of the matrix M (MC) are compared with a threshold to construct the associated binary classifier. The elements belonging to the same row of the original incidence matrix M are taken as ground truth. The discrimination threshold is varied from 0 to 1, using a step size equal to 0.01. All the elements of M (MC) are used as dataset, except the ones having the same indices as the originally NaN values in the RCA matrix. This allows to form a binary classifier for each threshold and for each country. The idea now is to exploit the AUC c of the binary classifiers associated with the countries in order to provide a measure of complexity of such countries, based on the predictability of the corresponding rows. Specifically, countries with lower AUC c may be considered as more complex, being harder for MC to predict their RCA entries. The AUC c alone, however, does not capture the reasons why MC performed poorly (or, vice versa, adequately). As an example, consider the three following hypothetical scenarios. Assume that MC performs poorly on a country c by attributing RCA ≥ 1 to a product p when its true RCA was smaller than 1, and assigned correctly a RCA smaller than 1 to all the other countries for the same product p (Scenario 1). Consider now the two following similar scenarios for which, for the same product p and the same country c, MC performs poorly on the country c by attributing RCA ≥ 1 to the product p when its true RCA was smaller than 1, and it attributed either correctly (Scenario 2) or incorrectly (Scenario 3) RCA ≥ 1 to all the other countries for the same product. It is reasonable to suppose that, all other things being equal, the country c to which MC assigned RCA ≥ 1 for the product p in Scenario 1 is more complex than the same country to which MC assigned RCA ≥ 1 for the product p in Scenarios 2 and 3. In fact, while in Scenario 2, MC could have been driven to predict, for country c, a RCA of p larger than or equal to 1 by the presence of several RCA entries larger than or equal to 1 for the other countries, this is not the case for Scenario 1. Scenario 3 is more unlikely to occur, since, as it is shown later in Section 4.1 and in the Supplemental, MC has typically a quite satisfying prediction capability in its specific application to the RCA matrix. In this case, it is not possible to conclude that country c is more complex than the other countries, since MC is wrongly attributing RCA ≥ 1 to p, for all such countries. The example above suggests us that, by adopting the AUC c alone as a complexity measure, country c would be classified as equally complex in Scenarios 1, 2 and 3 (assuming the AUC c being equal in all these cases). In order to correct for this, we propose a refined complexity measure, based on weighting the AUC c for each country c. The rationale of the proposed complexity measure is that not only less predictable countries (according to MC) are more complex, but one should also take into account the product dimension when comparing the MC predictions obtained for different countries, controlling for the quality of each prediction. More precisely, it is proposed to associate a weight w c to each country c, which is constructed in such a way that the AUC c 's of countries with an higher share of "rare" false positives are weighted less (since they are less predictable). In more details, the proposed complexity measure is constructed as follows.
1. First, the MC analysis made for the countries is repeated for the products, still referring to the same year. This is obtained simply by replacing at the beginning of the analysis the RCA matrix with its transpose. Analogously, the matricesM ( being f pr p,t the false positive ratio for the classifications associated with that product (determined by the comparison between M t

(MC)
and M , restricted to the entries associated with that product) and N p N p +P p the proportion of entries with true RCA < 1 with respect to all the entries associated with that product (i.e., 119). Besides, the average f tot p of f tot p,t with respect to t is computed. 4. Then, for each country c, the weight w c is defined as follows: In other words, for each country c, the weight w c is the average of f tot p with respect to all the products p for which one predicts RCA ≥ 1 through the surrogate incidence matrix (M ) (MC) . 5. Finally, the MONEY index for each country c is computed as:

Global performance of matrix completion
In the following, the diagnostic ability of MC is illustrated. Likewise in Section 3, the matrix M (MC) was combined with a threshold to construct a binary classifier (in this case, however, differently from Section 3, the threshold did not depend on the country). The discrimination threshold was varied from 0 to 1, using a step size equal to 0.01. All the elements of M (MC) were used as dataset, except the ones having the same indices as the originally NaN values in the RCA matrix. The ground truth was provided by the corresponding elements of the original incidence matrix M. Fig. 1 shows the resulting ROC curve. Similarly, ROC c curves (see Section 3) for a random sample of countries are displayed in Fig. 2.  stands for the line passing through the origin with slope 1.
As it is evident from Figs. 1 and 2, MC performed quite well on average both globally and for developed countries such as Japan, United States and Germany. Its performance was poorer (though still above the baseline) for countries that either provided less information on their trade flows or whose trade flows were extremely volatile (i.e., they alternated between products with extremely high RCA values and products with very low RCA values). Specifically, f nr c was higher for the latter countries. Nonetheless, the average performance of MC over all the countries was high as depicted by the AUC reported in Fig.  1, which turned out to be about 0.81 for the binary classifier described in Step 5 of Section 2. As a further check, since the positive and negative labels were unbalanced in the original dataset (specifically, entries with RCA < 1 represented almost the 70% of the entire dataset), we also applied the Balanced Accuracy (BACC) index 7 , which turned out to be 0.75. Figs.3a-3b display the original incidence matrix M as compared to the MC surrogate incidence matrix M (MC) obtained at the HS-4 level of product aggregation. The two matrices display similar but not identical entries. On one hand, their similarity confirms the good MC prediction performance at a global level. On the other hand, their differences could be attributed to the high complexity of specific country/product pairs being predicted. In other words, there may be a discrepancy between the actual RCA value of a country/product pair and its potential RCA value, predicted by MC on the basis of similar country/product pairs.

Results related to the MONEY index
In this section we report the ranking of countries in terms of economic complexity as expressed by the MONEY index introduced in Section 3. In particular, we represent the countries according to their MONEY index ( Fig. 4a), then we compare the obtained ranking with the one expressed by GENEPY (Fig.4b). In Fig.4a, countries are colored according to their MONEY values (normalized between 0 and 1), which are proportional to the shade of blue. In particular, the color map ranges from the least complex countries c (colored in white) to the most complex ones (colored in dark blue).  It is worth observing that both the GENEPY and the proposed MONEY index arise from the attempt to reconstruct (in a different way for each method) a matrix related to trade flows. In the case of GENEPY, the matrix is a proximity matrix N derived from the incidence matrix M (see the Appendix for the definition of the matrix N), and its reconstruction is obtained as a nonlinear least-square estimate based on the components of the first two (normalized) eigenvectors of that matrix. Then, a successive evaluation on how the quality of the estimate changes by dropping specific components of such eigenvectors (the ones associated with a given country) is made. In our case, the matrix A is obtained as a discretization of the RCA matrix. Then, MC is applied several times to the matrix A to reconstruct a portion of that matrix which has been obscured, in the attempt to uncover a "latent" similarity between countries, which can be useful for the prediction of whether their RCA entries are lower than 1, or higher than or equal to 1. Another difference is that the matrix reconstruction on which GENEPY is based relies only on two eigenvectors of N, whereas our method, being also based on MC, exploits a typically much larger number of left-singular/right-singular vectors to build the reconstructed matrix, for each application of MC. The choice of the 7/21 number of such pairs is made automatically by the adopted validation procedure. Moreover, a final evaluation of the quality of the reconstruction is made, by considering several test sets, on which the AUC c 's are based. A further quality assessment is provided by Tab. 1, which reports the number of G19+5 8  value based on its surrogateM (MC) are quite similar (see Fig. 5b for their difference). Hence, they provide analogous results in terms of the complexity of the countries, confirming the satisfactory prediction capability of MC for the specific learning task. Nevertheless, one can also notice that the two complexities differ in some countries. Such differences may be ascribed to surpluses/deficits of the actual complexities of such countries (i.e., the ones measured by GENEPY based on the original incidence matrix M) with respect to the respective predicted complexities (i.e., the ones measured by GENEPY (MC) , which is based on the surrogate incidence matrixM (MC) ). To quantify the correlation between the GENEPY rankings computed based on M andM (MC) , respectively, we evaluated their Kendall rank correlation coefficient τ k . The statistical test produced τ k 0.8 with a p-value near 0, rejecting significantly the null hypothesis of independence between GENEPY and GENEPY  It is worth noticing that, with a few exceptions (China, France, Italy, UK and Germany) the more complex the country according to GENEPY, the higher the difference between GENEPY and GENEPY (MC) . Finally, Fig. 6 displays the false positive rate f pr c for each country considered in the analysis, which turned out to produce a ranking of countries quite similar to the one generated by GENEPY (τ k = 0.75). 8 In the table we considered countries within G20. However, since G20 countries comprise EU (except France, Italy and Germany, which are accounted separately), that is an agglomerate of countries, we considered a group of 5 representative countries for EU, namely: Spain, Switzerland, Greece, Denmark and Hungary.

Discussion
In the present work, we applied Matrix Completion (MC) to investigate in various ways the economic complexity of countries. First, we assessed a quite high accuracy of the MC predictions, when MC was applied to reconstruct the Revealed Comparative Advantage (RCA) matrix, which is at the basis of the construction of several existing economic complexity indices (see the Appendix). Then, we proposed the Matrix cOmpletion iNdex of Economic complexitY (MONEY), based on the predictability of the RCA entries associated with different countries. As an additional contribution, we combined MC with a recently-developed economic complexity index (GENEPY), to assess the expected economic complexity of countries. In the work, MC was exploited to infer the expected discretized RCA of a country c in a certain class of goods or services p. The MC technique employed is based on a soft-thresholded SVD. This, combined with the MC validation phase, allows to select automatically a suitable number of singular vectors to be used to reconstruct the discretized RCA matrix. In this way, differently from previous economic complexity indices, the information extracted is not restricted to the first two singular vectors. The results of our analysis highlighted a generally quite good performance of MC in discerning country-product pairs with RCA values greater than or equal to the critical threshold of 1, denoting the competitiveness of c in producing p. The outcomes were summarized by reporting the global ROC curve and comparing the heat-map of the true incidence matrix M and the one of its MC surrogate matrixM (MC) , which was obtained from various applications of MC. Motivated by the high MC accuracy, we developed the MONEY index taking into account both the predictive performance of MC for each country (as measured by its AUC c ) and the product dimension. In other words, when constructing that index, each AUC c was weighted by the average of the f tot p 's with respect to a subset of products associated with the specific country. As a further step, we applied the GENEPY algorithm first to the incidence matrix derived directly from the original RCA matrix, then to the MC surrogate incidence matrix M (MC) . This allowed us to directly compare the values of the two GENEPY indices, thus assessing their potential discrepancies. On average, such discrepancies were higher for more complex countries according to the original GENEPY index.

Matrix completion via nuclear norm regularization of the reconstruction error
Given a subset of observed entries of a matrix A ∈ R C×P , Matrix Completion (MC) works by finding a suitable low-rank approximation (say, with rank R) of A, by assuming the following model: where C ∈ R C×R , G ∈ R P×R , whereas W ∈ R C×P is a matrix of modeling errors. The rank-R approximating matrix CG is found by solving a suitable optimization problem (provided by Eq. (2), in the case of the present article). Eq. (10) can be written element-wise as A c,p = ∑ R r=1 C c,r G p,r +W c,p . A common interpretation of this equation is as follows (see, e.g., the application of MC to collaborative filtering for movie ratings 7 ). The number C c,r can be interpreted as the degree of membership of row c of matrix A to some "latent" cluster r (for a total of R such clusters), and G p,r as the prediction of an element in column p of matrix A, conditioned on its row c belonging to cluster r. It is worth mentioning that such an interpretation holds regardless of the signs of the elements C c,r and G p,r . As an example, in the case of collaborative filtering for movie ratings, c denotes a specific person, p a specific movie, whereas r may be interpreted as a specific movie genre. In this work, MC is formulated via the optimization problem (2). The objective function of this optimization problem is the sum of two terms: the first one refers to the reconstruction error of the known portion of the matrix, whereas the second one is a regularization term, which biases the reconstructed matrix to have a small nuclear norm. The regularization constant λ controls the trade-off between fitting the known entries of the matrix A and achieving a small nuclear norm. The latter requirement is often related to getting a low rank of the optimal matrix Z • , which follows by geometric arguments similar to the ones typically adopted to justify how the classical LASSO (Least Absolute Shrinkage and Selection Operator) penalty term achieves effective feature selection in linear regression 17 .

9/21
The MC optimization problem (2) can be also written as where, for a matrix Y ∈ R C×P , (P Ω tr (Y)) c,p := Y c,p if (c, p) ∈ Ω tr , otherwise it is equal to 0. Here, P Ω tr (Y) represents the projection of Y onto the set of positions of observed entries of the matrix A, and Y F denotes the Frobenius norm of Y (i.e., the square root of the summation of squares of all its entries). The MC optimization problem (11) can be solved by applying the following Algorithm 1, named Soft Impute 13 (compared to the original version, here we have included a maximal number of iterations N it , which can be helpful to reduce the computational effort when one has to run the algorithm multiple times, e.g., for several choices of the training set Ω tr and of the regularization constant λ , as in the present work): Algorithm 1: Soft Impute 13 Input: Partially observed matrix P Ω tr (A) , regularization constant λ ≥ 0, tolerance ε ≥ 0, maximal number of iterations N it Output: Completed matrix Z λ ∈ R C×P 1. Initialize Z as Z old = 0 ∈ R C×P 2. Repeat for at most N it iterations: In Algorithm 1, for a matrix Y ∈ R C×P , P ⊥ Ω tr (Y) represents the projection of Y onto the complement of Ω tr , whereas S λ (Y) := UΣ Σ Σ λ V , being Y = UΣ Σ ΣV (with Σ Σ Σ = diag[σ 1 , . . . , σ R ]) the singular value decomposition of Y, and Σ Σ Σ λ := diag[(σ 1 − λ ) + , . . . , (σ R − λ ) + ], with t + := max(t, 0). It is worth mentioning that a particularly efficient implementation of the operator S λ (·) is possible (by means of the MATLAB function svt.m 11 ), which is based on the determination of only the singular values σ i of Y that are higher than λ , and of their corresponding left-singular vectors u i and right-singular vectors v i . Indeed, all the other singular values of Y are annihilated in Σ Σ Σ λ . A final remark has to be made about the trade-off between prediction capability and biasedness of MC. Biasedness in MC depends, among others issues, on the way the selection of unobserved entries is made 5,12 (in the specific case of our application of MC to the discretized RCA matrix, only entries belonging to a suitable subset of rows of the matrix A are obscured). For some MC algorithms, de-biasing is possible 5 , and can even improve prediction capability. Nevertheless, in general biasedness can be beneficial to prediction capability, due to the well-known trade-off between bias and variance 6 . In the particular case of MC achieved via the Soft Impute algorithm, biasedness can be ascribed also to the presence of the regularization constant λ (indeed, for both λ → 0 and λ → +∞, the predictions of the optimal solution to the optimization problem (11) tend to 0 for the unobserved entries), and to the fact that the Soft Impute algorithm is initialized by a matrix with all entries equal to 0, and terminated at most after a given number of iterations.

Technical details on the construction of the matrix A and on the the application of the Soft Impute algorithm
This subsection details the construction of the matrix A for our specific problem. As a first step, we removed from the RCA matrix its rows associated with countries having less than 5 million inhabitants. Then, the remaining entries of the RCA matrix were encoded into 9 groups according to increasing percentiles in the distribution of RCA values. This pre-processing step was done in order to make the elements of the resulting matrix A ∈ R 119×1243 of the same order of magnitude. In particular, we defined 4 negative groups ("-4", "-3", "-2", "-1"), representing the case 0 ≤ RCA < 1 (with the group "-4" being the one associated with the lowest values in the RCA distribution) and 4 positive groups ("1", "2", "3", "4"), representing the case RCA ≥ 1 (with the group "4" being the one associated with the highest values in the RCA distribution) 9 . Originally NaN RCA values were included in the remaining group "0". In our application of MC, the elements in this group "0" were included neither in the training set, nor in the validation/test set, since no ground truth was available for them. For computational efficiency reasons, we combined the original MATLAB implementation of Soft Impute 13 with the MATLAB function svt.m 11 . The tolerance of the algorithm was chosen as ε = 10 −9 . Its number of iterations was set to N it = 1500. The regularization parameter λ was sampled 30 times uniformly on the closed interval [−1, 15] in a logaritmic scale with base 2. A post-processing step was included in MC, thresholding to −4 any element (when present) whose MC reconstruction was lower than -4, and to 4 any element (when present) whose MC reconstruction was higher than 4.

Generalized economic complexity index and related economics complexity indices
The GENeralised Economic comPlexitY (GENEPY) index is a recently-introduced economic complexity index 15 , which can be applied to assess the complexity of both countries and products. It is based on a multidimensional representation of their complexity, which makes it possible to combine, in a single index, the different features of some previously-developed one-dimensional economic complexity indices: the Fitness (F) for countries and Quality (Q) for products, both computed by the Fitness and Complexity (FC) algorithm 16 , and the Economic Complexity Index (ECI) for countries and Product Complexity Index (PCI) for products, both obtained by the earlier Method of Reflections (MR) 9 . Each of the latter methods is typically able, indeed, to capture only a specific aspect of economic complexity: for instance, when applied to countries, FC is mainly related to the degree of diversification of the export basket of each country, while MR essentially captures the similarities in the export baskets of the different countries 15 . The GENEPY index arises from the first two (normalized) eigenvectors (with the eigenvalues ordered in a weakly decreasing way) of a suitable symmetric proximity matrix, which is derived from an incidence matrix M ∈ R C×P obtained by thresholding and binarizing the matrix of Revealed Comparative Advantage (RCA) values. The two eigenvectors capture, respectively, information obtained by the FC method and the MR one. The GENEPY index for countries is obtained in the following way (a similar construction holds for the GENEPY index for products).
1. First, for a specific year, the matrix RCA ∈ R C×P of RCA values in that year is determined (see the Introduction for details).
2. In order to extract topological information from the RCA matrix, an incidence matrix M ∈ R C×P is generated, whose entries are defined as follows: 9 As already reported in the Appendix, MC is biased towards low absolute values. To take this into account, we constructed groups that were symmetrically distributed around zero, as the final goal was to discriminate between RCA values respectively lower than 1, and larger than or equal to 1.
Then, its weighted version W ∈ R C×P is considered, whose generic element is defined as W c,p := Mc,p kck p , where k c := ∑ P p=1 M c,p is the degree of the country c in the graph represented by the incidence matrix M, and k p := ∑ C c=1 Mc,p kc represents the degree of the product p corrected by how easily that product is found within the subnetwork of countries. 3. The matrix N ∈ R C×C is constructed, whose elements N c,c * are defined as follows: Due to the weighting involved in the construction of the matrix W, the resulting matrix N is symmetric. Each entry N c,c * of N represents the proximity of the two corresponding countries c and c * . 4. The (normalized) eigenvectors x 1 , x 2 ∈ R C associated with the two largest eigenvalues λ 1 ≥ λ 2 ≥ 0 of N are determined. Their components are denoted as x c,1 and x c,2 , respectively, for c = 1, . . . ,C. 5. Then, the GENEPY index of country c for the specific year is defined as follows: The specific nonlinear transformation from x c,1 and x c,2 to GENEPY c , which is used in Eq. (14), can be justified by rigorous statistical arguments, based on the use of the two (normalized) eigenvectors x 1 and x 2 to get a nonlinear least-square estimate of the matrix N, and on the evaluation of how relevant x c,i and x c,2 are to obtain that estimate 14,15 . It is worth mentioning the qualitative difference between the GENEPY index and the ones determined by the FC and MR methods, considering again the case in which they are all applied to countries.
• The GENEPY index is highly related to a linearized version of the F index computed by the FC method 15 , in which one searches for the (normalized) eigenvector associated with the largest eigenvalue of a slightly different matrix N F ∈ R C×C than the matrix N. The specific matrix N F is written as N F := WW , where W is the same weighted incidence matrix considered in the context of the GENEPY index. The difference with respect to the case of the matrix N defined in Eq. (13) is that its diagonal entries are not set to 0 (in Eq. (13), such a choice of the diagonal entries is done in order to make the resulting N be a proximity matrix). • The ECI index, computed by MR, is based on searching for the (normalized) eigenvector associated with the second-largest eigenvalue of a slightly different matrix N ECI ∈ R C×C than the matrix N considered by GENEPY. The specific matrix N ECI is written as N ECI := W ECI W ECI , where the elements of W ECI ∈ R C×P are defined as W ECI,c,p := Mc,p kckp , being k p := ∑ C c=1 M c,p the degree of the product p in the graph represented by the incidence matrix M. The second-largest eigenvalue of N ECI is considered, instead of its first-largest one, as one can show that the (normalized) eigevector associated with the latter is non-informative, for the specific matrix N ECI . Similar comments hold for the case of the FC and MR methods when they are applied to products (obtaining, respectively, the Q index and the PCI index).

Supplemental MONEY index for the years 2005 and 2014
The present section provides robustness checks on the MONEY index. In particular, the whole analysis has been repeated for  The figures display similar results to those obtained in the main analysis. However, some differences emerge. Specifically, Russia and East Asia appear to be more complex in 2005 and 2014 than in 2018. Such countries were indeed rapidly developing in those years and their growth rate was higher than the one of EU, which was slowed down by the 2008 financial crisis. The crisis had an impact also on USA. Its MONEY index was in fact lower in 2014 as compared to 2005. In the successive years, yet, EU and USA recovered from the crisis, and their complexity, as measured by the MONEY index, raised accordingly in 2018. Fig. 8 reports, for the product aggregation level HS-2, results similar to those obtained in Section 4.3 of the main text for the HS-4 level. For the sake of completeness, we report also results for the false negative rate. Fig. 9 provides similar results, restricting to the countries for which both the false negative rate and the false positive are lower than 0.5.   Additionally, Tab. 2 reports, for the product aggregation level HS-2, the Kendall correlation coefficients τ k between the ranking produced using GENEPY against the ones produced using either f nr c,hs−2 or f pr c,hs−2 .

Application of the second part of the analysis to countries for the year 2018, with products aggregated at the HS-2 level
GENEPY (τ k ) GENEPY (p-value) f nr c,hs−2 0.1230 0.0575 f pr c,hs−2 0.6476 0.0000 Table 2. Kendall rank correlation coefficients τ k and corresponding p-values for the 2018 ranking of countries based on the HS-2 level of aggregation and produced using GENEPY against the 2018 rankings produced respectively by f nr c , and f pr c .
Similarly, Figs. 10a-10b, which refer to products aggregated at the HS-2 level, show results similar to those obtained at the HS-4 level.  values.
Additionally, Figs. 11a-11b report the original incidence matrix M as compared to its MC surrogateM (MC) obtained at the HS-2 level of product aggregation. Also in this case, the two matrices display similar but not identical entries. Thus, similar conclusions to the ones obtained for the HS-4 case apply. However, for the HS-2 case, the percentage of elements in which the two matrices differ is lower.    for the year 2018 at the HS-2 level.

13/21
Countries colored in blue are countries for which GENEPY< GENEPY (MC) . Countries colored in green are countries for which GENEPY GENEPY (MC) . Countries colored in red are countries for which GENEPY> GENEPY (MC) .

Results of the analysis for countries, for the years 2005 and 2014 and at the HS-2 level
In the following, results similar to those obtained in the main text are reported for the years 2005 and 2014. In this case, in order to reduce the computational effort, the analysis was made at the HS-2 level of product aggregation. Fig. 12 reports false negative and false positive rates for the two years, whereas Tab. 4 reports Kendall correlation coefficients between the ranking produced using GENEPY against the rankings produced respectively by f nr c,t and f pr c,t , for the years t = 2005 and t = 2014.   Table 4. τ k and relative p-values for the rankings produced using GENEPY against the ranking produced respectively by f nr c,t and f pr c,t , for the years t = 2005 and t = 2014, with products aggregated at the HS-2 level.

Application of the analysis to the products at the HS-2 level
The same analysis made in the main text for the countries has been repeated for the products, still referring to the year 2018. This is obtained simply by replacing at the beginning of the analysis the RCA matrix with its transpose. Notice that this analysis, as some of the analyses reported in this Supplemental, was made at the HS-2 level for computational time reasons.
The results obtained at the HS-2 level, however, correlated at the 95% with the ones obtained at the HS-4 level. Figs. 13 displays respectively, on the main diagonal of each matrix reported, and for a subset of product codes, • the false negative rate f nr p for each product p (Fig.13a); • the false positive rate f pr p for each product p (Fig.13b); • the false negative rate f nr p , for the subset of products p for which both f nr p and f pr p are lower than 0.5 (Fig.13c);

16/21
• the false positive rate f pr p , for the subset of products p for which both f nr p abd f pr p are lower than 0.5 (Fig.13d).
In all these cases, the products have been ordered increasingly with respect to the (either false positive or false negative) rate.

(a)
False negative rate f nr p for the products, reported as a function of the color shade from blue (associated with the lowest value) to red (associated with the highest value).

(b)
False positive rate f pr p for the products, reported as a function of the color shade from blue (associated with the lowest value) to red (associated with the highest value).  The correspondence between product codes and their names is reported in Tab. 5. The names are reported only for the HS-2 level for a better readability.

17/21
Row index Product name   1  Live animals  2  Meat and edible meat offal  3  Fish and crustaceans, molluscs and other aquatic invertebrates  4  Dairy produce  5  Products of animal origin, not elsewhere specified or included  6  Live trees and other plants  7  Edible vegetables and certain roots and tubers  8  Edible fruit and nuts;  9 Coffee, tea, mate and spices 10 Cereals 11 Products of the milling industry; 12 Oil seeds and oleaginous fruits; 13 Lac; gums, resins and other vegetable saps and extracts 14 Vegetable plaiting materials 15 Animal or vegetable fats and oils and their cleavage products 16 Preparations of meat, of fish or of crustaceans 17 Sugars and sugar confectionery 18 Cocoa and cocoa preparations 19 Preparations of cereals, flour, starch or milk; pastrycooks' products 20 Preparations of vegetables, fruit, nuts or other parts of plants 21 Miscellaneous edible preparations 22 Beverages, spirits and vinegar 23 Residues and waste from the food industries; prepared animal fodder 24 Tobacco and manufactured tobacco substitutes 25 Salt; sulphur 26 Ores, slag and ash 27 Mineral fuels, mineral oils and products of their distillation; bituminous substances  and other products of the printing industry  50  Silk  51  Wool, fine or coarse animal hair  52  Cotton  53  Other vegetable textile fibres  54  Sewing thread of man-made filaments  55  Man-made staple fibres  56  Wadding, felt and nonwovens  57 Carpets and other textile floor coverings 58 Special woven fabrics 59 Impregnated, coated, covered or laminated textile fabrics 60 Knitted or crocheted fabrics 61 Articles of apparel and clothing accessories, knitted or crocheted 62 Articles of apparel and clothing accessories, not knitted or crocheted 63 Other made up textile articles 64 Footwear, gaiters and the like; parts of such articles 65 Headgear and parts thereof 66 Umbrellas, sun umbrellas and similar articles 67 Prepared feathers and down and articles made of feathers or of down 68 Articles of stone, plaster, cement, asbestos, mica or similar materials 69 Ceramic products 70 Glass and glassware 71 Natural or cultured pearls, precious or semi-precious stones, precious metals 72 Iron and steel 73 Articles of iron or steel 74 Copper and articles thereof 75 Nickel and articles thereof 76 Aluminium and articles thereof 78 Lead and articles thereof 79 Zinc and articles thereof 80 Tin and articles thereof 81 Other Works of art, collectors' pieces, and antiques Table 5. Product codes and corresponding names at the HS-2 level. Fig. 14 reports, for the HS-4 level of product aggregation, the confusion matrix built from the normalized GENEPY 10 values associated with the products for the year 2018, computed respectively based on the incidence matrix M and the surrogate incidence matrix (M ) (MC) , then discretized in 8 classes according to the percentiles of the respective GENEPY distributions. The class 1 corresponds to the lowest GENEPY values, whereas the class 8 corresponds to the highest GENEPY values. Notice that the whole procedure applied to countries was repeated here from scratch starting from the transpose of the RCA matrix, focusing in this case on the products. Very similar results (τ k 0.96) are obtained if instead, one computes the GENEPY for products using as input the transpose of the surrogate incidence matrixM (MC) obtained in the application of MC to countries, described in the main text. In the figure, the true classes are the ones computed starting from the GENEPY applied to the incidence matrix M , whereas the predicted classes are the ones computed starting from the GENEPY applied to the surrogate incidence matrix (M ) (MC) . 10 Since the GENEPY and GENEPY . Some outliers emerged. The deviations regarded mainly the elements belonging to higher categories. This is somewhat reasonable since they refer to more complex products, which might be more complex to classify. In this case, the Kendall rank correlation coefficient between the GENEPY rankings computed based on M and (M ) (MC) turned out to be τ k 0.6, with a p-value nearly equal to 0. Similarly, we built up a confusion matrix also for the HS-2 case. This is reported in Fig. 15.

Application of a variation of the analysis to countries, based on the entry-wise logarithm of the original RCA matrix for the year 2018, with products aggregated at the HS-4 level
In the following, a variation of our analysis is applied to countries. In this variation, instead of discretizing the elements of the original RCA matrix, they are replaced by their natural logarithm 11 . Then, the rest of the proposed method is unchanged with respect to Section 2. In the main text, the discretization method has been preferred, because it generates values more symmetrically distributed around 0. Fig. 16 reports, for this variation of analysis and for the product aggregation level HS-4, results similar to those obtained in Figs. 8a and 8b for the original analysis and for the product aggregation level HS-2.  Similarly, Tab. 6 finds, for this variation of analysis and for the product aggregation level HS-4, results similar to those obtained in Tab. 4 for the original analysis and for the product aggregation level HS-2.
GENEPY (τ k ) GENEPY (p-value) f nr c,log,hs−4 -0.2569 0.0000 f pr c,log,hs−4 0.5903 0.0000 Table 6. τ k and relative p-values for the rankings produced using GENEPY against the ranking produced respectively by f nr c,t and f pr c,t , for the product aggregation level HS-4, for the case in which the entry-wise natural logarithm of the original RCA matrix is employed by the original analysis.
Finally, also Fig. 17, which refers to the variation of analysis and to products aggregated at the HS-4 level, shows results similar to those obtained in Fig. 10b, which refers to the original analysis and to products aggregated at the HS-2 level. 11 No issues arise when taking the natural logarithm, because no 0 or negative entries are present in that matrix. Moreover, entries originally equal to NaN are never included in the training, validation or test sets. Figure 17. Difference between GENEPY and GENEPY (MC) for the year 2018, with products aggregated at the HS-4 level, for the case in which the entry-wise natural logarithm of the original RCA matrix is employed by the proposed method.