The available literature for SARS-CoV-2 reinfection risk suggests that although rare, reinfection is possible, but available estimates vary considerably and are mostly based on case reports/series and some cohort studies. We retrospectively analyzed a large repository of complete data for a single province, with the assistance of big data analysis, identifying more than 35 thousand subjects with an initial positive result for SARS-CoV-2, registered over an 18 months period, and with extended analysis of at least 6 months follow-up. We report 1,258 cases of SARS-CoV-2 reinfection (3.5%) compared to a total of 1,705 cases of reinfection reported among the SRs included in our SR overview. Despite inherent limitations of big data analysis, our observation suggests that its application in the context of a complete database of reported infections for a single population, can more accurately estimate true rates of SARS-CoV-2 reinfection.
Multiple questions regarding reinfection associated with SARS-CoV-2 are still ongoing. What is the pathophysiological mechanism for reinfection, who are the subjects with a higher risk of reinfection and what is the clinical burden for reinfected patients? Reinfection with the SARS-CoV-2 virus can be mainly attributed to two phenomena: decay of the immune response and viral mutations that favor the appearance of new variants. The decay of immunity or the failure of naturally acquired immunity, may result in reinfection with the same virus strain , whilst viral mutations may make subjects vulnerable to reinfection [26, 27]. New virus variants could evade immune responses acquired in subjects with infections from previous variants or reduce the capacity for neutralization by polyclonal antibodies [26, 28]. This issue suggests the need to increase the current knowledge about the degree of protection provided against SARS-CoV-2, leading the development of vaccines and the creation and implementation of appropriate interventional strategies.
Current evidence confirm that patients infected by SARS-CoV-2 produce antibodies against Spike and N proteins within 30 days from infection  but the mechanism of mediate immunity are not fully understood. Infection by SARS-CoV-2 activates T and B cells, leading to the production of neutralizing protein inhibiting viral infectivity through various mechanisms of action, including blocking the binding of the Spike protein with the ACE2 receptor . IgM appears quickly but has a very short half-life. Specific IgG develop a few days after IgM and can be determined in serum about 7–14 days from symptom onset . A recent systematic review  reported differences in the presence of antibodies during the first infection (56%) and reinfection (63%), suggesting that waning antibodies could places individuals at a risk for reinfection. The presence of antibodies could provide a protective role, but it does not specifically prevent reinfection . Furthermore, it has also been suggested that a previous COVID-19 infection may not confer total immunity, paving the way for a potential second infection by a different variant, with the second infection being potentially more severe than the first .
Currently, there are discordant rates of reinfection reported in SRs (ranging from 0–50%), which could partially be explained by the heterogeneous adopted definitions of reinfection. Today, there is still no universal agreement on the determination of the correct time period between positive results for SARS-CoV-2 for the definition of reinfection, although the definition provided by CDC is the most accredited . Further, most SRs mainly include case series or case reports, with limited examples of reinfection. Our big data analysis was conducted in a unique environment of complete data stored in a single warehouse including all SARS-CoV-2 testing with PCR in a single province, analyzed according to the most commonly accepted definition of reinfection in literature. With the collection of a large number of reinfected cases, possible causes, important information for the discrimination of reinfection from recurrence, and the definition of subjects with higher risk of reinfection can be evaluated.
It has been pointed out that the severity of reinfection depends on the individual immune response, as well as both the viral load and the SARS-CoV-2 variants causing the reinfection . A reinfection can then be of the same or greater intensity, and it is probable that it is mostly due to a new species of coronavirus . Garduno-Orte et al.  described 4 cases of reinfected healthcare workers, showing that in two cases the reinfection resulted in a more severe case. Likewise, Massachi et al.  reported that the 41% of reinfected patients experienced greater symptom burden than initial infection. Conversely, Wang et al.  reported that 69% had similar severity, 19% had worse symptoms, and 12% had milder symptoms with a second episode. Our study does not include information regarding disease severity, making any examination about the clinical and social implication of SARS-CoV-2 reinfection impossible to make.
Our study does, however, include subjects’ vaccination status, providing important considerations for the risk of reinfection provided by natural immunity and vaccines. In this study, the rate of reinfection among vaccinated subjects was lower than that observed among non-vaccinated subjects. If antibody decay is associated with susceptibility to reinfection, we may observe further reinfections over the next months. Likewise, if the immune response vaccine-induced is likely to decay as the natural immune response, the need for booster immunization will require re-evaluated to maintain ongoing protection.
Our work is an example of application of big data analysis in laboratory setting, enabling real estimates of incidence of reinfection, to identify factors affecting reinfection, such as strains of the virus or patient immune characteristics, and ponder the involvement of the vaccine in this pandemic. The Big Data concept refers to a complex analysis of a huge set of data, which requires the use of dedicated analytical and statistical approaches. The method uses advanced computational methods to extract information from datasets and build new association models. There are multiple sources of data (administrative databases, electronic health records, epidemiological studies), so it is important to develop an appropriate integration and analysis system to translate the information from analysed data to appropriate clinical decisions.
The growing data availability and greater analytical capacity, can improve results not only in the economic and financial fields, but also in public health, supporting diagnostic pathways, developing prognostic predictive models of disease, personalizing therapeutic regimens and, can also find an application in prevention initiatives [37, 38]. The application of Big Data analysis in healthcare has numerous advantages as it enables: (i) the integration of different datasets and builds different algorithms and more complex learning models to find new genetic, biological and clinical associations , (ii) direct analyses of data from an entire population, overcoming the limitations associated with statistical approaches applied to the analysis of data from a representative sample to make inference on a population (even if randomized controlled trials remain the "gold standard" to study treatment efficacy), (iii) the observation of effects of long-term treatments. However, this approach has limitations due to the high variability of data and data collection methodologies. These limits can be resolved by the use of adequate computation systems, thereby helping to reduce bias and make data more functional.
In the interpretation of our results, some limitations due to the lack of information about symptoms and immunological status of subjects analyzed, and the viral strain causing infection and reinfection, should be considered. The definition of a positive result 90 days after an initial infection as reinfection used in this analysis cannot exclude a possible reactivation of a latent infection. Further, the true rate of infection is assumed to be underestimated, as many asymptomatic subjects are not tested for viral RNA research and, among those tested, genomic sequencing is not always performed, rendering the identification of the precise variants causing infection and reinfection very difficult to make. The third factor to consider in the evaluation of reinfection, is a potential false negative molecular result at discharge and a subsequent positive result being due to persistent infection .
Our big data analysis of a complete population confirms an overall risk of reinfection by SARS-CoV-2 of 3.5%, with unvaccinated and younger subjects more susceptible to reinfection. More data will become available over time, and big data analysis will enable its timely integration into considerations of targeted strategies to control and prevent reinfection, increasing value in the patient's care pathway and supporting healthcare systems. In the meanwhile, social distancing, the use of masks and hand hygiene remain the main preventative measures against primary infection and reinfection of SARS-CoV-2. A standardized approach to identify and report reinfection cases is necessary.