Newcomb-Benford Law Analysis on COVID-19 Daily Infection Cases and Deaths in Indonesia and Malaysia

Aims: Each country has been racing to contain the spread of COVID-19. The published data of daily infection and death cases can be used to measure the eﬀectiveness of the control interventions. We focus our study in two Southeast Asia countries: Indonesia and Malaysia during time period between March and November 2020. Methods: Newcomb-Benford law has been commonly used to analyze the probabilities of the ﬁrst signiﬁcant digits in natural occurrences since the late 19th century. It is a prominent statistical tool for its capability to detect frauds in data sets. A chi-squared test was recruited to quantify the closeness of the data and Newcomb-Benford law distributions. Results: The results revealed that the distributions of daily infection and death cases in Indonesia followed Newcomb-Benford law while the opposite results were obtained for Malaysia. Conclusion: We have done the analysis of verifying the daily COVID-19 infection and death cases in Indonesia and Malaysia using Newcomb-Benford law. It can be inferred that, between March and November 2020, the control interventions in Indonesia was less eﬀective compared to Malaysia.


Introduction
Corona Virus Disease-19  was originated from Wuhan, China, which has caused an outbreak since December 2019. Globally, the virus has infected over 2020 [1]. The virus is a real threat to all mankind because it is highly contagious that it can be easily spread through the droplets produced by coughs or sneezes [2]. Every country in the world was racing to contain the spread of the virus by executing some control interventions to "flatten the curve" of COVID-19 infections.
In this paper, we focus on two neighboring Southeast Asia countries, i.e. Indonesia and Malaysia.
To mitigate the COVID-19 outbreak, the Government of Indonesia took several immediate actions, such as conducted rapid reverse transcription polymerase chain reaction (RT-PCR) test detection, set up a special task force for COVID-19, implemented large-scale social restrictions (LSSR), and released stimulus package as a response of the social restrictions [3][4][5]. On the other hand, the Government of Malaysia imposed movement control order (MCO), targeted screening, and released three stimulus packages [4,6]. To monitor the effectiveness of the control interventions, both Government authorities published the number of COVID-19 confirmed cases and deaths on daily basis. Please also note that there might be motivation to modify the number of daily and mortality cases of COVID-19 in such a manner that the international image is covered.
On the other hand, there occurred at least three problems for Government in measuring the effectiveness of the control interventions: 1) some infected people were asymptomatic so that the number of daily cases may not be accurate [7]; 2) there were many fake news which made people puzzled and distrust of the Government [8]; and 3) different standards of calculating the death rate (Case Fatality Rate vs. Infection Fatality Rate) [9]. Based on the above reasons, we intend to validate the number of daily cases, as well as the mortality rates, for Indonesia and Malaysia using Newcomb-Benford Law (NBL). In particular, we applied NBL first digit law on the Indonesia and Malaysia daily infection reports and mortality numbers. NBL is known to be more stable when the number range is higher. In a nutshell, NBL states that the first digit of any natural series of data is not of equal distribution but of a very specific distribution. Hence, it has been generally accepted as a validating tool to detect forgery of information. The results, additionally, can be extended to evaluate the effectiveness of "flattening the infection curve" by using appropriate choices of significant control interventions [10].

Motivation
It has been reported in the news that Indonesia has not been successful in handling the COVID-19 spread. According to an epidemiologist from Indonesia, the unsuccessful virus containment was due to lack of discipline in health protocols among government officials and also low prevention awareness in public. Yet, there were no decisive actions taken by the Government [11] in terms of improving the testing capacity, contract tracing and quarantine measures. This is supported by the report in [12] that Indonesia was still struggling to mitigate the spread of COVID-19.
However, Indonesia Government claimed that their actions were quite good [13].
On the contrary, Malaysia has been showing a strong capacity and stable foundation in outbreak preparedness and response. In October 2020, the World Health Organization (WHO) commentary on Malaysia Government's capability and effort to contain the virus transmission was laudatory [14]. Malaysia took immediate actions to curtail the spread of virus with the implementation of non-pharmaceutical interventions. Furthermore, Malaysian citizens remain vigilant and maintain good practice of recommended precautionary. However, no official publications about the quantitative measures in response to these claims have been made. Contradictory claims (or fake news) may cause unnecessary misinformation diffusion trend on social media. Therefore, with regards to the mentioned claims, we should conduct a study to assess the effectiveness of control interventions implemented in Indonesia and Malaysia. In this paper, we aim to present quantitative measures for the reliability of respective claims for both Indonesia and Malaysia between March and November 2020.

Previous Works
The reporting of cases has become such a critical issue and is under intense scrutiny.
In [15], authors reported that the Brazilian "data blackout" got international visibility when John Hopkins University threatened to drop Brazil from the world-wide data set. The authors also claimed that Brazilian data (new case/mortality) did not conform to the NBL distribution of the first digit. The reported χ 2 for daily new cases and mortality were 23.363 and 14.115 respectively. On May 1, Brazil reported that COVID-19 was the cause of death for 5,901 people. On June 5, Brazil's Health Ministry took down the website that reported cumulative coronavirus numbers -only to be ordered by the country's Supreme Court to reinstate the information. This is a case which could lead to an issue of under-reported COVID-19 cases. The researchers in [16] applied Kuiper's test on the daily new cases of COVID-19. The authors mentioned that China, USA, and Italy results passed the test of NBL and thus the data were considered valid in terms of data accuracy. We note that Kuiper test is a form of modified Kolmogorov-Smirnov test which is commonly used for continuous data to identify the difference between two cumulative distributions.
While ongoing debates are being carried out on the use of NBL to validate government data, certain researchers have, on the other hand, indicated that NBL can only be valid when the pandemic is growing and widespread. In [10], authors applied and simulated counter measures for control of pandemic notably by adjusting infection rate. Results showed that there was low conformity to the NBL distribution as predicted. Even in the early stages of applying NBL to COVID-19 daily cases in [17], certain criteria were given such that it will be valid. For example, the data to be tested has to grow 10% or more, the data series is 50 or more days, and the magnitude change is 3 or more.
The limitation of χ 2 test is its high sensitivity to the sample size and it tends to reject null hypothesis even for small departure from the expected distribution.
However, the Kolmogorov-Smirnov test is more suited to non-categorical continuous data which is essentially what most COVID-19 daily reports belong to. Some metrics such as Root mean squared (RMSE) and Mean Absolute deviation (MAD) are some commonly applied for further analysis in other areas related to NBL.

Data and Methods
NBL states that the first significant digits (FSD) in any natural occurring set are not of equal probabilities but follow a specific distribution as stated in the following equation [18] P (D 1 = d 1 ) = log 10 1 for where D 1 is a discrete random variable defined over a set of FSD. Using (1), the probability of getting the first significant digit of 1 is 0.301, 2 is 0.176, 3 is 0.125, and so on. NBL is commonly used for detecting manipulation and fraud in tax data, stock price data, etc. [19].
The data underlying this paper were obtained from Worldometer website. The graphs of cumulative COVID-19 cases of Indonesia and Malaysia are depicted in Fig. 1. Analysis was performed using chi-squared goodness of fit test, χ 2 at 5% significance level. We also computed the p-value and the Root Mean Square Error (RMSE) against the expected distribution. We used the following hypotheses for our test: Hypothesis 0 (H0) Both distributions (real case and NBL distributions) are the same.
As the number of independent sample points (number of FSD of interest) is 9, the degree of freedom used for chi-squared test is 8. As a result, the threshold of χ 2 (8) = 15.507 was acquired and used to determine the acceptance or rejection of H 0 . Further, a p-value which is less than or equal to the significance level indicates there is sufficient evidence to conclude that the observed distribution is significantly different from the expected distribution.

Results
The distributions of the number of daily infection and mortality cases for Indonesia and Malaysia are illustrated in Fig. 2 The χ 2 value was 55.97, which is anomalously lower than the one shown in Table   1 and more compliant to NBL. This may confirm the statement that Malaysia is seeing its third wave of COVID-19 epidemic which has started since late September 2020.
Since the Indonesia's infection and death cases data fully conforms to the NBL of the first digit, we further extended NBL test to the second digit. The expression to acquire the second digit distribution is given by for where D 2 is a discrete random variable defined over a set of second digits. Table 2 shows the statistics of the second digit distributions acquired using similar methods that we applied to the first digit statistics. The results indicate a very high conformity to NBL distribution. This would imply that the data shows that the number of COVID-19 grew steadily and there was legitimacy of the data according to NBL of the second digit. Note that we ignored the entry that did not reach second digit (< 10).
We provide a simple and brief explanation to this phenomenon. If the cases were steadily arising, the accumulative cases would result in a consistent change in each of the first-and second-digit distributions. However, in the case where the numbers initially grew and then the growth slowly declined, the first and second digits would be stagnant at a particular digit. This would lead to non-conformity to the NBL distribution. In the case where the COVID-19 cases steadily increased and found not to conform to NBL, this would be an indication of data manipulation or unnatural setting. This is certainly not the case with Indonesia's data as both the first-and second-digit distributions were found to conform the NBL distribution.

Discussion
The distributions of daily infection and death cases in Indonesia followed NBL, for both first digit and second digit statistics. It can be inferred that Indonesian Government provided trustworthy data. However, from another point of view, the trend might subtly imply that the control interventions were not successful. This result echoed the findings of the study conducted by Suraya, et.al. [5].
Conversely in Malaysia, the distributions clearly did not obey NBL. It can be inferred that the process did not occur naturally. The reasons behind could be either less accurate data reported by the authorities or success of the control interventions.
The latter should be more reasonable to be accepted as the results demonstrated and emphasized by Amiruzzaman,et.al. [21]. This finding is also supported by the statistics which showed the increase of the NBL conformity during the time period when the MCO was loosened.
The evaluation of COVID-19 data in two Southeast Asia countries has been conducted in this study. We also provide the second digit NBL test to confirm the trend obtained from the first digit NBL distribution. Therefore, we justify and complete the findings in other studies which claim that the control interventions in Indonesia have not shown positive results on its effectiveness and the MCO in Malaysia was effective in containing the COVID-19 spread which was imposed in the entire nation during the first few months of COVID-19 outbreak.
We note that this study did not focus on the mathematical modeling of epidemic growth. The analysis were conducted fully based on the provided data available on Worldometer, which reflects the data published by Government authorities.

Conclusions
We

Ethics approval and consent to participate
This research did not involve human participants and human data.

Consent for publication
Not applicable.

Availability of data and materials
The data were taken from Worldometer website.

Competing interests
The authors declare that they have no competing interests.