Who Manipulates Data During Pandemics? Evidence from Newcomb-Benford Law

We use the Newcomb-Benford law to test if countries have manipulated reported data during the COVID-19 pandemic. We find that democratic countries, countries with the higher gross domestic product (GDP) per capita, higher healthcare expenditures, and better universal healthcare coverage are less likely to deviate from the Newcomb-Benford law. The relationship holds for the cumulative number of reported deaths and total cases but is more pronounced for the death toll. The findings are robust for second-digit tests, for a sub-sample of countries with regional data, and in relation to the previous swine flu (H1N1) 2009-2010 pandemic. The paper further highlights the importance of independent surveillance data verification projects.


Introduction
On March 11, 2020, the World Health Organization (WHO) declared the novel coronavirus disease 2019 (COVID-19) a pandemic. With tens of millions of confirmed cases and over two million of deaths, this pandemic has spurred a great number of controversies, including many related to the accuracy of the data countries report. Mass media organizations around the globe argue that many countries have continued to manipulate the data for political or other gains. 1 In this paper, we study the association between the accuracy of COVID-19 data reported by countries and their macroeconomic and political indicators. Our results show that countries that are more functional democracies, have higher income, and stronger healthcare systems report more accurate data. The relationship exists for the cumulative number of confirmed cases and for the cumulative number of reported deaths; however, the results are more pronounced for the number of deaths, indicating that less developed countries are more likely to manipulate mortality data.
To gauge data accuracy, we use compliance with the Newcomb-Benford law (NBL), which is an observation that in many naturally occurring collections of numbers the first digit is not uniformly distributed. The numeral "1" will be the leading digit around one-third of the time; the numeral "2" will be the leading digit 18% of the time; and each subsequent numeral, "3" through "9," will be the leading digit with decreasing frequency. One property of the NBL is that manipulated or fraudulent data deviate significantly from the theoretical NBL distribution. Due to the ease of its application and straightforward approach, the law has been extensively used to detect fraud and data manipulations. It has been applied to accounting, finance, macroeconomic, and forensic data to test for data manipulation and fraud. We apply the NBL to COVID-19 data for 185 countries affected by the pandemic. For each country, we first identify the period of exponential growth when the data are expected to obey the NBL. After the country's data reach a plateau, the number of cases stabilizes, and the data are not expected to obey the NBL. During the growth period for each country, we calculate four goodness-of-fit measures to estimate compliance with the NBL and use these measures as proxies for data manipulation.
We then study the relationship between our proxies for accuracy of data and indicators of the strength of the economy, democratic institutions, and healthcare systems. Specifically, we use the regression analysis to find the association between goodness-of-fit measures and the Economist Intelligence Unit Democracy Index, the gross domestic product (GDP) per capita, healthcare expenditures as percentage of GDP, and the Universal Health Coverage Index (UHC). Previous studies have shown in other settings that countries with weaker democracies and less economic development are more likely to manipulate data and have lower transparency. 2 The governments of such countries fabricate data for political gains and to consolidate power. Autocratic governments control most mass media outlets and often censor inconvenient facts that undermine the ruling regime. COVID-19 presents such a case because the wide spread of the pandemic and high death tolls would send a negative signal to citizens and indicate that the government is incompetent. Autocratic governments would try to downplay the scale of the pandemic for the sake of appearances.
Our main hypothesis is that countries with weaker democracies, and weaker economic and healthcare systems will have lower data accuracy as measured by the NBL goodness-of-fit statistic.
Our results in many tests support the hypothesis. We find that our four goodness-of-fit measures (which measure deviations from the theoretical distribution as given by the NBL) are negatively correlated with macroeconomic indicators (Democracy Index, GDP, healthcare expenditures, and UHC). The results are true for the cumulative number of cases and the cumulative number of reported deaths. We also find the result is more pronounced for the reported number of deaths than for the number of confirmed cases. This indicates that, on average, autocratic regimes and poorer countries are more prone to manipulate death tolls than the total number of citizens infected.
We conduct a series of robustness tests and find that our results are not driven by the specific period in which we calculate the goodness-of-fit measures, by small countries, by countries with a small number of cases or deaths, or by countries with extreme deviations from the NBL. We also show that the same relationship between proxies for accuracy of data and economic indicators is observed when we apply the NBL to second digits.
One concern of our study is that the proxies for data accuracy are calculated based on limited sample sizes for individual countries. To resolve this potential problem, we confirm our findings for the sub-sample of 50 countries that provide regional data (at a state or province level). Regional data increase the sample size from which we calculate our statistics substantially and heighten the precision of our accuracy measures.
We find a similar relationship for the previous swine flu (H1N1) pandemic of 2009-2010. We repeat our analysis for 35 countries that reported weekly data of the number of confirmed cases and the number of deaths to Pan American Health Organization (PAHO). We discover support for the negative relationship between deviations from the NBL and the selected developmental indicators.
There is substantial body of literature assessing the tendency of misreporting COVID-19 surveillance pandemic data by countries using different statistical techniques, such as case fatality rates, excess mortality rates, the variance of reported data, the clustering of data, and even trends in search engines. 3 The inherent problem with these COVID-19 studies is that they rely on uniform approaches to measure confirmed cases and COVID-19-related deaths across countries and periods within one country. Even though countries are expected to follow the same guidelines provided by the WHO when reporting cases, many variations exist (and sometimes appear in states and regions within a country) in how they collect and report data. Any comparisons of raw numbers-like the total number of confirmed cases, the number of deaths, and mortality rates-among countries may be also problematic because they are most likely driven by the difference in other variables, like the number of tests conducted, the strength of the healthcare system, demographic composition, and reporting standards. Correct testing would require controlling for all those hard-to-observe variables. This makes comparisons between countries difficult.
One helpful property of the NBL is its tolerance to different data generating processes between observations (in our case, countries). As long as data in each country are expected to obey the NBL, cross-observation comparisons are possible. This means that we can apply the test even if countries differ in how they measure COVID-19 cases and related deaths. The test is also free of country-specific differences, including public policies used to stop the pandemic, like quarantines, social distancing, testing, and availability of treatment. The NBL test is only sensitive to human intervention and manipulation of data in otherwise naturally occurring processes.
Some studies apply the NBL to individual countries to test if a given country has been falsifying data during the pandemic (Idrovo and Manrique-Hernández 2020; Jackson and Sambridge 2020; Peng and Nagata 2020). We indicate that such tests are problematic because they largely depend on the sample size. We stress that our paper is not aimed to answer questions whether a particular government manipulates data. We use the same approach to calculate goodness-of-fit measures for all countries and then compare countries cross-sectionally based on developmental indicators. Our results indicate that data from autocracies and poorer countries should be trusted less, in line with the previous literature (Hollyer et al. 2011). It should also be noted that the NBL test is not directional. However, it is unreasonable to believe that if a country does not comply with the NBL, the government would willingly manipulate data to inflate the number of cases or deaths. Neither does the divergence from the NBL provide us specifics on how the data are being manipulated. We cannot ex ante predict which first digits will be over-or underinflated. For example, if the country's true number of cases is in the 2,000s and the government tries to falsify data to look smaller and reports high 1,900s, the first digit "one" will be over-represented in this country's statistics. However, if the country's numbers are in the 1,000s and the governments falsifies data into the 900s, then the first digit "one" will be under-represented.
Our paper contributes to the literature in several ways. First, our paper helps to resolve the controversy about different countries' data manipulation during pandemics and provides estimates of how widespread it is around the globe. Using the NBL and COVID-19 data, we document that about one-third of the 185 countries affected by the pandemic indeed seem to misreport their data. Second, our study shows which data, if any, countries are more likely to misreport. We document that governments tend to downplay the news with the highest negative impact, i.e., the death toll, to the highest degree. To a slightly lower degree, countries tend to manipulate the total number of confirmed cases. We find no indication, on average, of systematic data manipulation for the number of conducted COVID-19 tests and the number of cured cases. Third, we are the first study to use the NBL to show that the strength of healthcare systems, as measured by healthcare expenditures and the Universal Healthcare Index, are linked to the government's ability to provide reliable data during pandemics.
Finally, our paper contributes to the political economy and comparative economics literature. We are the first to document the cross-sectional link between macroeconomic and political regime indicators and the tendency to misreport data during pandemics. We show that authoritarian regimes and countries with low GDP per capita are more likely to falsify data. Thus, this study provides additional evidence of the link between democracy and transparency that is often taken for granted. Combined, the results are consistent with previous findings in the literature that authoritarian regimes and poorer countries' governments manipulate information to avoid negative news that may undermine their power.
Our study has broad implications. First, we provide evidence that the data supplied during pandemics may be of low quality, especially from autocracies and poorer countries, and we suggest that caution should be used when interpreting and using the data. Second, the study highlights the importance of initiatives to externally verify data provided by governments, including independent surveillance data verification projects. 4 Finally, we provide new evidence on the applicability of the NBL to detect data fabrication.
The paper is organized as follows: Section 2 presents the review of related literature and develops our main hypothesis, Section 3 discusses the NBL of anomalous numbers, Section 4 describes our sample and variables, Section 5 provides major results for the study, including robustness checks, and Section 6 concludes .

Literature Review and Hypothesis Development
Studies have long posed questions about whether democratic regimes provide more reliable data to the public than autocracies in both theoretical and empirical settings. For many, the intuitive answer is "yes." However, this depends on the definition of "democracy." If democracy is defined only through electoral competition (e.g., Przeworski et al. 1999;Schumpeter 1942), then the link between data reliability and a particular political regime is not obvious. Some authors argue that the expected relationship is actually reverse: greater vulnerability to public disapproval within democracies may lead to their higher tendency to falsify data (Kono 2006;Mani and Mukand 2007). Most studies, however, show that democracies indeed are more transparent. They argue that it is authoritarian regimes that are more vulnerable to negative information and have more incentives to distort and manipulate information that undermines their image. In addition, such regimes usually have control over mass media organizations and therefore have more capabilities to exercise control. Guriev and Treisman (2019) show that modern authoritarian regimes do not use ideological propaganda and political repression to the same extent as dictators in the twentieth century. Instead, in the twenty-first century, information is the key factor to obtain and retain power. Authoritarian regimes ground their legitimacy and support from citizens in strong economic performance and successful domestic and foreign affairs. When news that undermines this image is released, it threatens the survival of the autocrat. It hurts democratic leaders as well, but democratic regimes depend more on voter welfare, which, in turn, is contingent on available information (Hollyer et al. 2011). Therefore, democracies are more inclined to disclose truthful information. In addition, authoritarian regimes exert much tighter control over information supplied to the public, and as such they have easier ways to distort data. Autocrats use data manipulation to improve their public image and prolong their stay in office.
Indeed, Gehlbach and Sonin (2014) demonstrate that government media ownership increases media bias. 5 In line with this argument, Rozenas and Stukal (2019) propose that autocrats are more likely to manipulate data for which it is more difficult for citizens to obtain hard external information benchmarks. Autocrats manipulate such data with censorship or falsifications. Easily verifiable data are only framed to improve the image of the government. The authors provide examples of how citizens can easily benchmark news about income and market prices and, therefore, the government resorts to a tender narrative when reporting such news; whereas domestic politics and international affairs are hard to verify, and the government more easily falsifies these data. COVID-19 provides a unique setting to test a related hypothesis. Pandemic surveillance data are hard to acquire independently by citizens because they lack access to the necessary large-scale data collection and medical facilities. At the same time, the news that the disease is raging and is widespread under authoritarian rule would be an indicator of the inefficiency or failure of the government. The death toll is even more damaging to the image of the autocrat, who sees such news as a threat and tries to downplay the scale of the problem. We therefore formulate the following two hypotheses: Hypothesis 1: Democratic regimes are less likely to manipulate pandemic surveillance data.
Hypothesis 2: The link between democratic regimes and data manipulation is more pronounced for the reported death toll.
Extant empirical studies find support for the hypothesis that democracies provide more reliable data in different settings. For example, Bueno de Mesquita et al. (2003) argue that countries with larger "winning coalitions"-i.e., democracies-are more transparent than countries with small winning coalitions. By analyzing the reported by governments tax revenues and national income data, they find support for their hypothesis. Hollyer et al. (2011) use a model to back their similar hypothesis that in democratic regimes, governments are more willing to disclose policy information. Their empirical test is based on the willingness of governments to report data to the World Bank's World Development Indicators. Rozenas and Stukal (2019) use a corpus of daily news reports from Russia's largest state-owned television network, Channel 1, and find that the state-owned media systematically frames facts to make the government look better.
Some authors embrace extreme positions and claim that the reliability of data supplied by the government should be a measure of the country's democracy level, and that elections and pluralism alone are not enough (Dahl 1971). This is because, for elections to be fair, voters should make informed decisions, and informed decisions are only possible in regimes that provide reliable data to voters. Studies, thus, differentiate between two measures of democracy: a "thin" measure, or minimalist, i.e., covering only the election process and freedoms; and a "thick" measure that includes more general concepts like transparency and culture. Indeed, if democracy is defined, at least partially, through transparency, then any findings regarding the link between the two will be trivial. Therefore, the measure of democracy, preferably, should not include the degree of transparency.
We address this issue by constructing several measures of democracy. We start with the widely used Economist Intelligence Unit Democracy Index. We also study its five components: electoral pluralism, functioning of the government, political participation, political culture, and civil liberties. "Stripping" the Democracy Index helps to evaluate the political component of democracy that is not directly related to transparency: while some components are more likely to be related to transparency (like political culture), others are not (like electoral pluralism or the functioning of the government). We also use alternative democracy measures, e.g., the Freedom House Electoral Democracy Index and their broader democracy measure (which includes political freedom and civil liberties). The Freedom House Electoral Democracy Index is the "thin" definition of democracy, and should be unrelated to transparency to avoid spurious correlation.
We also adopt other measures that may explain countries' tendencies to manipulate data. Hollyer et al. (2011) maintain that GDP per capita is a measure of the "ability of the governments to collect and disseminate high-quality statistical data." We therefore include the GDP per capita in our tests. Because our setup has been created during the pandemic and the testing is done on surveillance data, we use two other proxies for each country's ability to collect and report reliable health-related data: health expenditures as a percentage of GDP, and the Universal Health Coverage Index. We thus formulate our third hypothesis as follows: To gauge data manipulation, we use compliance with the NBL. We compare countries crosssectionally to find if there is a relationship between developmental indicators and goodness-of-fit to the NBL measures. We are not the only paper to use the NBL to test the validity of reported data during COVID-19. Several other concurrent studies employ a similar approach. 6 However, these papers usually select one or a few countries and apply the NBL to test if there is any evidence of manipulation in a given country's data. The authors use the cutoff values from the chi-squared distribution (or similar distributions) and give a "yes-or-no" type of answer to their binary research question. In many cases, the goodness-of-fit measures are calculated with substantial errors, and many studies do not provide estimates for the statistical significance of the test or its power. In addition, these test statistics and inference results greatly depend on the sample size. With large enough sample sizes, the null hypothesis of compliance with the NBL will be rejected in almost every case. Some studies estimate their test statistic at the country level, some studies estimate it at a regional or state level, and some studies use county-level data. This leads to contradictory findings among these studies even when looking at the same country. Any inferences from such tests are also problematic.
Our approach is different. We use all countries affected by COVID-19. For each test, we employ the same approach for all countries to calculate the test statistic. We make any inferences from the NBL test only in comparison. We study the link between compliance with the NBL and economic indicators. The unit of observation in our analysis is the country because it is this relationship between the proxies for data manipulation and the democratic and economic indicators that we find significant in most tests. To our best knowledge, this is the first paper to examine the crosssection of all countries and compare them based on developmental indicators when analyzing data manipulation during pandemics.
Not many studies apply the NBL in an international setting, though there are several notable exceptions. Nye and Moul (2007) indicate that international macroeconomic data generally conforms the NBL. They find, however, that for non-OECD (African) countries, the data do not conform with the law, which raises questions about data quality and manipulation in these countries. Gonzalez-Garcia (2009) uses a similar approach to test the annual IMF data, but finds no connection between independent assessments of data quality and adherence to the first-digit NBL in different country groups. The limitation of these studies, however, is that they group countries based on geographical proximity, instead of some logical choice of economic indicators. Michalski and Stoltz (2013) provide a theoretical model and empirical findings that some countries strategically misreport their economic data for short-term government gains. The authors reveal that some groups of countries (i.e. countries with fixed exchange rate regimes, high negative net foreign asset positions, negative current account balances, or greater vulnerability to capital flow reversal) are more likely to falsify macroeconomic data than others. Our paper is different in that it applies the NBL to the pandemic data in the international setting. The countries in our study are grouped based on developmental, economic, and political indicators. We contribute to this body of literature by providing additional evidence that some types of countries are more likely to falsify not only macroeconomic data but also surveillance data during pandemics.

Newcomb-Benford Law of Anomalous Numbers
In many naturally occurring processes, the resulting data have the leading significant digit that is not uniformly distributed. The distribution is monotonically decreasing, with "1" being the most common first digit, and "9" being the least common. The law was formally stated by Newcomb (1881) and Benford (1938). A set of numbers is said to follow the NBL if the first digit d occurs with probability P (d) = log 10 (1 + 1 d ). 7 This gives the following probabilities for observing the first and second digits: Digit 0 1 2 3 4 5 6 7 8 9 First -30.1% 17.6% 12.5% 9.7% 7.9% 6.7% 5.8% 5.1% 4.6% Second 12.0% 11.4% 10.9% 10.4% 10.0% 9.7% 9.3% 9.0% 8.8% 8.5% The data are expected to follow the NBL when the logarithms of values are uniformly and randomly distributed. The NBL accurately describes many real-life sets of numerical data, including lengths of rivers, stock prices, street addresses, accounting data, populations, physical constants, and regression coefficients (Diekmann 2007). Data generated from many distributions and integer sequences have been shown to closely obey the NBL, including Fibonacci numbers, powers of numbers, exponential growth, many ratio distributions, and the F -distribution with low degrees of freedom. 8 Not all distributions generate data that follow the law. For example, uniform distribution, normal distribution, and square roots of numbers do not obey it. For the data to obey the NBL, several criteria should be satisfied (Cho and Gaines 2007;Diekmann 2007;Durtschi et al. 2004): • Data span several orders of magnitude and are relatively uniform over such orders • The mean is greater than the median, with a positive skewness • Naturally occurring processes, the data which are the result of multiplicative fluctuations, and data that is not influenced by human intervention The last requirement, i.e., the fact that human intervention usually generates data that violates the NBL, has led to its usefulness in detecting fraud and data manipulation. Studies have shown that when humans intervene with the data generating process that is expected to comply with the NBL, compliance stops. For example, Diekmann (2007) and Horton et al. (2018) show that when scientific data are fabricated, they do not conform with the NBL. The researchers find that retracted accounting papers significantly deviate from the NBL relative to a control group of papers. Cantu and Saiegh (2010) and Breunig and Goerres (2011) reveal the same effect for electoral data. In a similar spirit, Kaiser (2019) uncovers how discrepancies from the target NBL distribution can be used to test reliability among survey data sets. The NBL has been extensively used to detect fraud in accounting, finance, and macroeconmic data. Nigrini (2012) and Stambaugh et al. (2012) show that fraudulent trading records and fabricated returns do not comply with the the NBL, whereas naturally occurring data do. Rauch, Goettsche, et al. (2013) apply the NBL to the London Inter-bank Offered Rate (LIBOR) rates and successfully detect manipulated data. O'Keefe and Yom (2017) study the determinants of fraudulent behavior among failed banks between 1989 and 2015. They use the second-digit NBL to identify those banks whose financial statements suggest tampering and purposeful misstatements. Their results suggest that insider abuse and fraud at banks are detectable through an NBL analysis of bank financial data. Hussain (2010) detects any possible data errors, irregularities, or fraud by applying the NBL to the credit bureau data of commercial banks. By analyzing five European 7 The law can be extended to digits beyond the first. In general, for the nth digit, n ≥ 2, the probability is given by P (d) = 10 n−1 −1 k=10 n−2 log10(1 + 1 10k+d ). 8 For more examples, see Formann (2010), Hill et al. (1995), Hill (1998), Leemis et al. (2000), and Morrow (2020). equity market indices, Kalaichelvan and Shawn (2012) find evidence that substantiates the criticism for the use of the uniformity assumption for tests at the 1,000 level in favor of a distribution consistent with the NBL.
Another useful property of the data obeying the NBL is that it is scale invariant, i.e., it is independent of the measurement units. This makes it a powerful tool when testing data from different sources (i.e., countries, companies). The NBL is also not the same as the imprecision (or variance) of the data. The data may well be very noisy but still is expected to conform with the law, as long as there is no deliberate falsification of data. For example, if a country's data are collected with error or irregularly but there are no data manipulations, the first digit should still adhere to the NBL. In our application, it means that countries may differ in the way they count COVID-19 cases or deaths, but as long as the data for each country is expected to obey the NBL, we can test the data for the goodness-of-fit to the NBL.

Goodness-of-Fit Measures
To measure how well the data comply with the NBL, we use several goodness-of-fit measures. The most intuitive and commonly used is the chi-squared statistic: where O d and E d are observed and expected by the NBL frequencies for digit d, respectively. Chi-squared, however, has several problems: it has low statistical power when used with small sample sizes and enormous power with large sample sizes. Therefore, we use alternative measures of goodness-of-fit proposed in extant studies. We use a modified version of the Kuiper (1960) test proposed by Stephens (1970) and Giles (2007) that is less dependent on the sample size N : where F N (x) and F 0 (x) are the observed cumulative distribution functions (cdf ) of leading digits and the cdf of the data that comply with the NBL. In addition, we calculate the M -statistic proposed by Leemis et al. (2000): and the D-statistic proposed by Cho and Gaines (2007): where o d and e d are the proportions of observations with d as the first digit and proportions expected by the NBL, respectively. The latter two measures are also insensitive to sample size. We calculate each goodness-of-fit measure for two variables, the cumulative number of confirmed cases and the cumulative number of reported deaths. In unreported tests we also analyze two other variables: the number of cured cases and the number of conducted COVID-19 tests, and find insignificant results.

Sample Description and Developmental Indicators
We first collect daily data from John Hopkins University for the cumulative number of confirmed cases, the cumulative number of cured cases, and the cumulative number of deaths 9 between January 22, 2020 and June 10, 2020. We also obtain the number of conducted tests from Our World in Data. 10 Studies have shown that naturally occurring processes comply well with the NBL when the data grow exponentially or close to it (Formann 2010;Leemis et al. 2000). Once the data reach the plateau, they are no longer expected to obey the NBL. Hence, for the data to comply with the NBL, we select the growth part using the following approach. Because data show weekly seasonality, we first compute seven-day moving averages (MA) for the new daily number of confirmed cases. Then, for each country, we identify the date with the highest MA number of new daily confirmed cases. If there are several dates with the same maximum, we use the earliest as the cutoff. For our main analyses, we use data before the obtained cutoff for each country. 11 For developmental indicators, we select the following four proxies for democratic and economic development widely used in the literature: the Economist Intelligence Unit Democracy Index, GDP per capita, healthcare expenditures as a percentage of GDP, and the Universal Health Coverage Index. The Democracy Index (EIU ) is a weighted-average of answers to 60 questions from expert assessments grouped into five categories: electoral pluralism, functioning of the government, political participation, political culture, and civil liberties. The index is aimed to measure the degree of democracy of a country. In addition to the Democracy Index, we use GDP per capita as a proxy for the country's ability to provide precise data. We also take the country's healthcare spending as a percentage of GDP and its Universal Health Coverage Index as proxies for the strength of each country's healthcare system. We download countries' democracy indices from the Economist Intelligence Unit for 2019. 12 We collect the Gross Domestic Product (GDP ) per capita, healthcare expenditures as percentage of GDP (HE GDP ), and Universal Health Coverage Index (U HC) for 2017 from the World Bank. 13 We also acquire 2019 population data for each country from Worl-9 https://coronavirus.jhu.edu/map.html. Downloadable database is available at https://github.com/CSSEGISandData/COVID-19.
10 https://ourworldindata.org/coronavirus-testing. 11 In unreported tests, we also use modified approaches. We find the maximum ratio MA(number of new daily cases)/(Days since the first case for the country) and MA(number of new daily cases)/(Days since the latest nonzero case for the country). The results are robust to alternative definitions of the cutoff date.
12 https://www.eiu.com/topic/democracy-index. 13 https://data.worldbank.org/. Downloadable database is available from dometer. 14 A total of 185 countries with available data were affected by COVID-19. The summary for each country can be found in Appendix A1. Variable definitions can be found in Appendix A2. We find that we cannot reject the NBL distribution for the entire world population for the cumulative number of confirmed cases using the 1% significance level (Appendix A1). 15 For the cumulative number of reported deaths, however, we reject the hypothesis of compliance with the NBL for the total world numbers. This indicates that, on average, countries are more likely to falsify death tolls, and are less likely to falsify the confirmed number of cases. Using country-level data, we also find that between 37 and 62 countries (depending on the goodness-of-fit measure used) out of 185 deviate from the NBL when reporting the confirmed cases. Between 50 and 71 countries deviate from the NBL when reporting the number of deaths. 16 Table 1 provides descriptive statistics for the major variables in our analyses. For the confirmed number of cases, the goodness-of-fit measures show that the average country's data are about borderline in terms of compliance with the NBL: they are consistent with the NBL if 1% level of significance is used, but are not consistent if 10% is used. Observe further that the corresponding mean goodness-of-fit measures for the number of deaths are higher than for the number of confirmed cases and are lagrely not consistent with the NBL at 1%. This indicates that, ceteris paribus, countries are more prone to manipulate data on death rates. We must stress, however, that any inferences about data manipulation simply based on individual goodness-of-fit statistics is questionable because they largely depend on the selected sample size (country versus state, state versus county level data). In our analyses, therefore, we aim to compare countries cross-sectionally.
The average country in our sample has over 42 million people, slightly less than $6,000 in terms of GDP per capita, 17 with roughly 6% of the GDP spent on healthcare expenditures, a democracy index of around 55 on the scale between zero and 100, and around 65% of the population are covered by universal health care (Table 1). The average sample size used to estimate the goodness-of-fit measures per country is slightly over 61 days for the number of confirmed cases and is around 40 for the number of deaths (until the end of the growth period). Table 2 provides mean values for our goodness-of-fit measures for the four quartiles of each of our independent variables: EIU , ln(GDP ), HE GDP , and U HC. The quartiles for the EIU Democracy Index roughly correlate with the definitions of the four regime types: full democracy, flawed democracy, hybrid regime, and authoritarian regime. The table shows a general monotonic trend for the data to deviate more from the Newcomb-Benford distribution as we move from the highest quartile to the lowest. For the top quartiles, we cannot reject the hypothesis that countries manipulate confirmed cases or death data at the 1% level. For the bottom quartile, however, we reject that hypothesis about half the time for the number of confirmed cases, and almost every time for the number of deaths. In a univariate setting, this is consistent with our three hypotheses. We also find that for the cumulative number of deaths, the difference between goodness-of-fit measures for the top and bottom quartiles is always significant. Table 3 provides Pearson correlation coefficients between major variables. The four major economic indicators, especially EIU , ln(GDP ), and U HC, are highly correlated, with correlation coefficients ranging between 0.59 and 0.85 (all values are statistically significant). HE GDP is also https://databank.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG/1ff4a498/Popular-Indicators.
At the time we collected the data, many countries still did not have the World Bank data available for 2018 or 2019. 2017 is the latest year for which the data are available for all countries.
14 https://www.worldometers.info/. 15 The 1% threshold for all four measures are: 20.09 for Chi-squared, 2.00 for Kuiper, 1.21 for M, and 1.57 for D. 16 The values go up to between 62 and 103 and between 72 and 99 countries for the number of cases and deaths, respectively, when the 10% level of significance is used. Note also that switching from country-level data to state data or county-level data increases the statistics significantly (see Table A1.2 for the case of the U.S. county-level data. 17 Note that we use the natural logarithm of the population and GDP values when calculating averages for Table 1. The table presents the mean value of goodness-of-fit measures (Chi-square, Kuiper, M and D) for the cumulative number of confirmed and death cases, developmental indicators (EIU, ln(GDP), HE GDP, UHC), and other variables. The number of observations vary due to missing values. The original dataset is included in Appendix A1. ***, ** and * denote goodness-of-fit measures that correspond to significant differences from the theoretical NBL distribution at 1%, 5% and 10% level, respectively. We also analyze the difference between the Confirmed an the Death mean values for each goodness-of-fit measures using the t-test. 3, 2, and 1 indicate significant difference between the Confirmed and the Death cases at the 1%, 5% and 10% level, respectively. All variable definitions are in Appendix A2.
correlated with the other indicators, with correlation coefficients ranging between 0.37 and 0.46 (also significant). These variables are most likely proxies for the same indicator, the development level of a country, and therefore-to avoid multicollinearity-we include only one indicator at a time in our analysis. 18 The four goodness-of-fit measures are also highly correlated with each other. The total number of confirmed cases and the country's population are also significantly correlated. Univariate results in Table 3 also show that all goodness-of-fit measures are negatively correlated with the four economic indicators, with 22 out of 32 correlation coefficients being significant (all correlation coefficients for the cumulative number of deaths are significant).

Goodness-of-fit and Economic Indicators
We start with the simple ordinary least squares (OLS) regression model where our goodness-of-fit measures appear on the left-hand-side and economic indicators are on the right-hand-side: Goodness-of -f it i = β 0 + β 1 Indicator i + β 2 ln(P opulation) i + The table presents the mean values of goodness-of-fit measures (Chi-square, Kuiper, M and D) for the cumulative number of confirmed and death cases by four quartiles of developmental indicators (EIU, ln(GDP), HE GDP, UHC). Smallest, Q2, Q3 and Largest represent the values partitioned by quartile 25%, 50%, 75%. ***, ** and * denote goodness-of-fit measures that correspond to significant differences from the theoretical NBL distribution at 1%, 5% and 10% level, respectively. We also analyze the difference between the Smallest an the Largest quartiles for each indicator using the t-test. 3, 2, and 1 indicate significant difference between the Smallest and the Largest quartiles at the 1%, 5% and 10% level, respectively. All variable definitions are in Appendix A2.
where Indicator i denotes one of the four economic indicators: EIU , ln(GDP ), HE GDP , or U HC. Higher values of the goodness-of-fit measures indicate greater deviation from the NBL. If more developed countries are less likely to manipulate data, we expect the coefficient β 1 to be negative. How well the data for each country are expected to obey the NBL depends on the span. For example, countries with higher populations and more confirmed cases or deaths are expected to follow the NBL more closely. To control for that, we include the natural logarithm of the country's total population. 19 Even though the Kupier, M, and D-statistics are more independent of the sample size, goodness-of-fit measures may still be affected by the sizes of the samples used to estimate them. To control for the sample size effect, we include N umber of Days i , which is the number of days with nonzero confirmed cases (or the number of days with nonzero deaths) between January 22, 2020 and the cutoff date for the growth part for each country.
The results of estimating Equation 6 are presented in Table 4. Panel A provides estimates for the cumulative number of confirmed cases. All but one coefficient in front of economic indicators are negative. Coefficients for ln(GDP ) and U HC are always significant. The coefficient for EIU is significant only when the chi-squared goodness-of-fit measure is used, and the coefficient for HE GDP lacks significance in all tests. Panel B provides estimates for the cumulative number of deaths. All coefficients are negative, and all are significant. The magnitude of coefficients for the number of deaths is also much higher than that for the number of confirmed cases. We find that the coefficients for corresponding economic indicators are statistically different from each other between Panels A and B in each case in Table 4.
To disentangle the political and other components of the Economist Intelligence Unit Democracy Index, we then analyze the five components of the index separately: electoral pluralism (ELECT ), functioning of the government (GV M T ), political participation (P ART ), political culture (CU LT ), and civil liberties (LIBERT ). The results are presented in Table 5. We find that the results for the overall EIU index are driven by its three components: electoral pluralism, functioning of the government, and civil liberties (the significance of these coefficients coincides in each case with the significance of the overall index), but not political participation or political culture (these coefficients are never significant). We also substitute the EIU measure with the "thin" definition of the democracy, i.e., the Freedom House Electoral Democracy Index. We use the reported by Freedom House dummy variable for electoral democracy (F H DEM ), as well as the sum of their measures of political freedom and civil liberties (F H AV ). 20 Again, for the cumulative number of cases, the coefficients are negative but lack significance. For the cumulative number of deaths, the coefficients are negative and significant. The results show that the findings are not driven by the choice of the democracy measure or by the spurious transparency component in the index.
We interpret the data as being consistent with the argument that more democratic and more highly developed countries are less likely to deviate from the NBL when reporting pandemic data. Specifically, countries with higher GDP per capita and universal health coverage are less likely to manipulate their data during COVID-19. For the democracy index (EIU ) and health expenditures as percentage of GDP (HE GDP ), we find convincing evidence only for the number of deaths. We conclude that the relationship is more pronounced for the total number of deaths than for the number of confirmed cases. As predicted, the control variable ln(P opulation) i is negative and significant in all regressions: countries with higher populations (and total number of cases)   deviate less from the NBL. The results are also economically significant: an increase of one standard deviation in the economic indicators, on average, results in a 0.25 increase of the standard deviation in the goodness-of-fit measures. This value is roughly the same for the number of confirmed cases and for the number of deaths, across all economic indicators. Instead of looking at the linear relationship between the goodness-of-fit measures, one could look at the probability of a country's data to deviate from the NBL. To do that, we first identify the critical values (at the 1% significance level) for each goodness-of-fit measure and create a set of four dummy variables, Chi − sq., Kuiper, M , and D, where each dummy variable equals one if the corresponding goodness-of-fit measure is above the critical value (i.e., we reject the null hypothesis that the data obeys the NBL), and zero otherwise. We then estimate Equation 6 using the logit model. Again, most coefficients for the economic indicators are negative, and all coefficients for ln(GDP ) and U HC are significant. For the death toll, all coefficients for all economic indicators are negative and significant. For brevity, we omit the table with the results. We again conclude that more developed countries and countries with better health systems follow the NBL more closely, and the relationship is more pronounced for the number of reported deaths than for the number of confirmed cases. We also assert that the findings are not driven by the choice of the model.
Finally, we conduct the same tests for the cumulative number of cured cases and the number of COVID-19 tests conducted. On average, we cannot reject the null hypothesis that countries manipulated data on cured cases or the number of conducted tests. The regression results are also not significant, indicating systematic cross-sectional difference between countries. We conclude that countries are most prone to falsify mortality data, slightly less so the number of confirmed cases, and that there is no evidence of systematic data falsification of cured cases or the number of tests. The cross-sectional difference between countries is also the strongest for the death toll, weaker for the total number of confirmed cases, and is insignificant for the number of cured cases and tests.

Robustness Analyses
One limitation of our analysis above is that it depends on the cutoff date for the growth part of the data. The cutoff date is estimated using the data, the validity of which we are assessing. This creates a possible endogeneity problem. To resolve this issue, we use several approaches. First, instead of using a specific cutoff date for each country, we use the same, "global," cutoff date for all countries, which is 80 days after January 22, 2020 (or April 11, 2020). We pick 80 days because it corresponds to the second tercile of cutoff dates in our sample. Fewer dates will result in too small sample sizes for many countries, especially those that were affected by the pandemic later than others. For longer periods, too many countries will have already reached their plateaus, and are thus no longer expected to obey the NBL. We then calculate our four goodness-of-fit measures using the global cutoff date and re-estimate Equation 6. The results are reported in Table 6. Panel A has the data for the confirmed cases. All coefficients for all four economic indicators are negative and significant in all regressions. Panel B has the data for the death count. Again, all coefficients in all regression are negative and significant, confirming our earlier findings.
Second, we use 45 days since the first case for each country. We again select the second tercile. We re-estimate Equation 6 for the new measures of the goodness-of-fit. The results are presented in Table 7, which are largely consistent with our previous findings. 21 Finally, we use the following "window" approach: instead of using one cutoff date, we estimate a series of goodness- of-fit measures over a range of dates (specifically, we try ±1, ±3, and ±5 days around the original cutoff date). We find the average over those values, and re-run regression 6. Again, the results are unchanged (untabulated). We conclude that our results are not driven by the specific selection of the cutoff dates for the growth part of the data. Another concern is that our data might be driven by countries with few cases or few data points. To test for that, we exclude countries with lower than 200 (500 and 1,000) total confirmed cases. We then exclude countries with fewer than 30 (40) days of nonzero cases. We also exclude countries with the highest 1% (5%) goodness-of-fit measures. In reported tests, we find that the results are robust in all cases. We conclude that our results are not driven by small countries, countries with a small number of cases, or by extreme deviations from the NBL.

Regional Data
Testing for compliance with the NBL requires sufficient data. For many countries in our analysis, the goodness-of-fit measures are calculated based on relatively small samples sizes between 40 and 140 days. Even though we control for the sample size and conduct robustness checks, making inferences from results based on such small sample sizes might be problematic. The sample size may increase significantly for a country if it reports data at a regional (state, territory, or provinces) level. Each reported value at the regional level can then be used to estimate goodness-of-fit measures, instead of using the country-level data. The method has the upside that the goodness-of-fit measures are estimated with a greater precision, though the downside is the lack of countries that collect regional data.
Fifty out of 185 countries in our sample collect regional-level data. 22 For these countries, we re-estimate the goodness-of-fit measures and re-run Equation 6. The results are reported in Table 8. 23 Panel A depicts the confirmed number of cases; Panel B the number of deaths. For the cumulative number of cases, 12 out of 16 coefficients are negative, with eight demonstrating significance. For the cumulative number of deaths, all coefficients are negative and significant, even with the much smaller number of countries for these tests. The results are consistent with our earlier finding: countries with higher democracy indices, GPD per capita, health expenditures, and universal healthcare coverage are less likely to manipulate pandemic data, especially the number of deaths. We further conclude that our findings are not driven by the errors in goodness-of-fit measures.

The Case of the United States
The United States of America collects data not only at a state level, but also at the county level. This enables a deeper analysis of individual states. The Unites States is classified as a "flawed democracy" by the Economist Intelligence Unit Democracy Index, with around $60,000 GDP per capita, an unprecedented 17% of GDP spent on healthcare expenditures, and a high value of the Universal Health Coverage Index: 84. Appendix A1 shows that there is no systematic indication of COVID-19 data manipulation, either for the cumulative number of confirmed cases, or for the cumulative number of reported deaths: all goodness-of-fit measures are below critical values at the 1% significance level. Yet, much controversy exists as to whether individual states manipulate COVID-19 data (King 2020;Smith et al. 2020). Mass media outlets argue that state governments  downplay the spread of the virus for political gains. It is not clear, however, if there is a crosssectional difference in the political parties of the state's governments (i.e., the legislative branch, or senate, and the executive branch, or governor) in a particular state's misreporting of data. To see if political systems drive the differences between states, we conduct a modified analysis, when we treat each state as a separate "country." We measure the GDP per capita and HE GDP variables at the state level. We then substitute the EIU regime indicator with three alternative political indicators: W on, a dummy variable indicator that the incumbent U.S. president (Republican) won the state during the previous election; Senate, a dummy variable indicator that the state legislature derives its majority from the same political party as the current U.S. president; and Governor, a dummy variable indicator that the governor of the state is from the same party as the incumbent U.S. president. We re-estimate the end of the pandemic growth period and, using county level data, re-calculate the eight goodness-of-fit measures (four for the cumulative number of confirmed cases, and four for the cumulative number of deaths). We then estimate Equation 6. The results are shown in Table  9. The ln(GDP ) is mostly insignificant. The negative sign of HE GDP in all tests is consistent with our previous findings, albeit the coefficients lack significance in most regressions. W on and Senate are positive and significant in all tests, indicating that states that voted for the incumbent U.S. president and have governments led by the same political party are more likely to manipulate the pandemic data. We also show that the results for the two variables are stronger for the death toll than the number of confirmed cases, consistent with our earlier findings. We highlight here that the sample size for the U.S. tests is much smaller than the original sample, consisting only of 50 states. Finally, the Governor variable is insignificant throughout.

Second Digit Tests
The NBL can be extended to digits beyond the first (O'Keefe and Yom 2017 and Hussain 2010). Beyond the second digit, the theoretical distribution quickly converges to uniform. Diekmann (2007) notes that, when fabricating data, test subjects also naturally lean toward smaller first digits, resulting in Benford-like distributions of fabricated data. He suggests that in some cases the second-digit test may provide a clearer assessment of data manipulation. Therefore, we repeat our tests but use the second-digit goodness-of-fit measures instead of the leading digit. Our sample size drops somewhat, especially for the number of deaths, because the test requires values higher than ten. The results are presented in Table 10, again, with two panels: one for the confirmed number of cases, and one for the number of deaths. In Panel A, all coefficients are negative and nine out of 16 coefficients are significant. In Panel B, all coefficients are negative, and, except for two coefficients for U HC, none are significant. We conclude that second-digit test results accord with our main findings.
In unreported tests, we also combine our robustness checks, i.e., we conduct tests using regional data and second-digit tests for global cutoff dates and during the 45 days since the first case for each country. Our findings are not affected by choice of the test method or time period.

Swine Flu Pandemic of 2009-2010
A natural extension to our study is to see if the negative relationship between goodness-of-fit measures and economic indicators holds for other pandemics. The unit of observation in our study is a country, but pandemics that engulf many countries and for which data are available are rare in modern history. One natural candidate is the recent swine flu (H1N1)   All models are estimated using OLS regression. P -values for a one-tailed t -test are in parentheses. ***, ** and * denote significance at the 1%, 5% and 10% levels, respectively. All variable definitions are in Appendix A2. and August 2010. The pandemic affected 58 countries, with tens of thousands (in some estimates, millions and even hundreds of millions) of people infected and tens of thousands (in some estimates, hundreds of thousands) of deaths. Even though the pandemic happened in relatively recent times and after the advent of the Internet, surveillance data availability and reporting was much more limited in 2009 than during COVID-19. Many countries did not collect daily or even weekly data, reporting was limited, and there was little public availability of data. As a result, only a very small number of studies directly test the accuracy or manipulation of data during the swine flu (H1N1) 2009-2010 pandemic (a notable exception is the study by Idrovo, Fernández-Niño, et al. 2011). The WHO, Pan American Health Organization (PAHO), and the Center for Disease Control and Prevention (CDC) provide many estimates for the total number of cases and deaths, but these cannot be used with the Newcomb-Benford test because the test gauges human intervention in actual reported data.
To apply the NBL test, we collect data for 35 countries in the Americas that provided weekly reports of the number of confirmed cases and deaths to the PAHO. We obtain the data for the weekly number of confirmed cases and the distribution of first digits from Idrovo, Fernández-Niño, et al. (2011). The data for the weekly number of deaths is downloaded from the PAHO website. 24 We then repeat the analyses and re-estimate regression 6 for swine flu (H1N1) 2009-2010 data. For the economic indicators, we use 2009 values. The results are reported in Table 11. Panel A depicts the results for the number of confirmed cases. Out of 16, 12 coefficients in front of macroeconomic indicators are negative. It should be noted that the sample size for this test is extremely small, with at most 35 countries. Obtaining significant results with such small sample sizes is challenging. Yet, we are able to obtain significant coefficients for five coefficients, and two more just barely lack significance. Panel B illustrates the results for the number of deaths. The sample size for Panel B is even smaller: 14 out of 16 coefficients are negative, and seven are significant. We conclude that the swine flu (H1N1) 2009-2010 results are largely consistent with our findings for the COVID-19 pandemic.

Discussion and Conclusion
In this paper, we investigate the relationship between the accuracy of reported data and macroeconomic indicators for a set of 185 countries affected by the COVID-19 pandemic. We use the deviation from the Newcomb-Benford law of anomalous numbers as a proxy for data manipulation. For approximately one-third of countries, we document some evidence of data manipulation, especially for the death toll. We find the negative relationship between the four NBL goodness-of-fit measures and four economic indicators. We also find that the relationship is stronger for the number of deaths than for the number of confirmed cases. Overall, we conclude that democratic regimes and more economically developed countries, and countries with stronger healthcare systems, provide more accurate data during pandemics. Authoritarian regimes and poorer countries, on the other hand, are more likely to manipulate data, specifically the death toll. We do not believe that our results are driven by noise in the data or the specific method used because they are robust to alternative testing periods, and are not driven by small countries, countries with a small number of cases, or extreme deviations from the NBL. We also show that the relationship holds for 50 countries that report regional data, for second-digit tests, and for the previous swine flu (H1N1) pandemic.
The interpretations of our findings assume that deviations from the Newcomb-Benford law are indicative of data manipulation. Indeed, many studies in macroeconomic, accounting, finance, and forensic analysis demonstrate that human intervention and data manipulation create data sets that violate the NBL. Many naturally occurring processes, on the other hand, generate data that obey the law. This makes the NBL a useful tool to detect data manipulation. However, several limitations to our study should be mentioned. Although we use compliance of data with the NBL as a proxy for non-manipulation of data, alternative interpretations of our findings are possible. The test works only if the data are expected to obey the NBL. Several indicators suggest that the pandemic data are a good candidate to test with the NBL. In addition, we use careful techniques to separate the "growth" part from the data. However, if the pandemic data are not supposed to follow the Newcomb-Benford law, then the observed relationship can be explained by the expected deviation from the law based on other factors, like sample size and the span of the data. We control for these effects in our tests. The aim of this paper is not to provide evidence whether a particular country manipulates data. Such claims require precise estimations of the goodness-of-fit measure and clear evidence that the country's expected distribution is indeed the NBL. The number of days since the beginning of the pandemic and before a country reaches a plateau (when the NBL is no longer applicable) is small. Therefore, the goodness-of-fit measures are estimated with error and conclusions for individual countries are difficult to state with utmost certainty. In contrast, this study documents a general relationship between macroeconomic indicators of countries and their tendency to report inaccurate data. The paper leads to a question about whether falsifying data during pandemics is a short-lived strategy for governments. Does it have immediate payback or is it sustainable over the long run? We should also note that even though the Newcomb-Benford test is less sensitive to noise in the data, there is still some chance that the divergence from the expected distribution is not due to the deliberate supply of falsified of data but the low quality or structural breaks in the data.
Our paper highlights the importance of independent projects to verify data supplied by the governments. Further research is needed that would combine different methods that test for data manipulation, including the Newcomb-Benford law, biostatistics, moments of distributions, excess mortality rates, and social media data. Even more important is research related to methods that can prevent data manipulation and fraud during pandemics.