Ethnicity-dependent allele frequencies are correlated with COVID-19 case fatality rate

Coronavirus disease (COVID-19), caused by SARS-CoV-2, has a higher case fatality rate (CFR) in European ethnic groups than in others, especially East Asians. One explanation to this phenomenon might be TMPRSS2, a key processing enzyme essential for viral infection. Here, we analyzed the allele frequencies of two nonsynonymous variants rs12329760 (V197M) and rs75603675 (G8V) in the TMPRSS2 gene using over 200,000 present-day and ancient genomic samples. We found a signicant association between the CFR of COVID-19 and the allele frequencies of the two variants. Interestingly, they had opposing effects on the CFR: inverse correlation by V197, proportional correlation by G8V. East Asians have higher V197M and lower G8V allele frequencies than Europeans, possibly endowing resistance against SARS-CoV-2. Structural and energy calculation analysis of the V197M amino acid change showed that it destabilizes the TMPRSS2 protein, possibly affecting its ACE2 and viral spike protein processing negatively, ultimately resulting in reduced SARS-CoV-2 infection eciency and CFR in East Asian ethnic groups.


Introduction
Coronavirus disease  is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Appearing rst during late 2019 in Wuhan, China, COVID-19 has spread rapidly worldwide 1 . As of May 23, 2020, SARS-CoV-2 has infected >5 million people in over 200 countries, killing more than 330,000 people 2 . Europe has been particularly affected, with Spain and Italy each reaching over 200,000 cases of infection and more than 27,000 deaths, resulting in a maximum case fatality rate (CFR) of >10% 2 . In contrast, East Asia did not experience such dire effects, with South Korea, for instance, reporting a peak CFR of 2.4% 2 . Multiple contributing factors could explain this difference, including timing and severity of lockdown measures 3 , population age ratio 4 , healthcare resource availability 5 , smoking rate 6,7 , and early tuberculosis (Bacillus Calmette-Guérin) vaccination [8][9][10] .
Genes encoding cellular serine proteaseTMPRSS2),, angiotensin-converting enzyme 2 (ACE2),, cysteine proteases cathepsin B and cathepsin L (CatB, CatL),, phosphatidylinositol 3-phosphate 5-kinase (PIKfyve),, and two pore channel subtype 2 (TPC2) are notable for their critical roles in SARS-CoV-2 infection 14,15 . Particularly, the virus utilizes TMPRSS2 and CatB/L proteolytic activity for priming the viral spike protein, whereas ACE2 is the entry receptor for breaking into host cells 14,15 . A study has suggested TMPRSS2 inhibition as a clinical target because the priming step is a key factor determining successful entry into target cells 15 . Not only do TMPRSS2 variants appear to have wide population-speci c variation 16 , but, TMPRSS2 also has low mutation burden in certain populations, a characteristic that could partially explain high TMPRSS2 gene expression. Consequently, the latter is associated with a poor outcome in COVID-19 16 .
To understand the genetic background of complex phenotypes in human populations, researchers commonly assess correlations with allele frequency (AF) 16,17 . This approach has identi ed a correlation between ancestral genetic composition and the CFR of COVID-19 17 . However, few have examined speci c variants, their frequencies and individual contributions to SARS-CoV-2 susceptibility. Some reports are also based only on low-resolution intercontinental comparisons between Europeans and East Asians [16][17][18] . Moreover, we know little about the evolutionary history of SARS-CoV-2 susceptibilityassociated variants, including when they occurred or how their frequencies might have changed over time.
In this study, we investigated intercountry AF differences of TMPRSS2 variants, estimated variant effects on TMPRSS2 protein structure stability, and linked them to the average of time-adjusted COVID-19 CFR (AT-CFR). We propose that the structural deviation causes TMPRSS2 to be less stable, resulting in a reduced overall infection rate that led to reduced CFR in East Asians. We collected and analyzed 221,498 genomes from public databases 19-21 and 2,262 whole genomes from the Korean Genome Project 22 . We also traced TMPRSS2 AF distribution in ancient populations by region and time period. We aimed to increase the current understanding of the genetic variation underlying SARS-CoV-2 infections and explain the ethnic differences in CFR.
Correlation between TMPRSS2 V197M allele frequency and COVID-19 AT-CFR The AF of V197M was negatively correlated with COVID-19 AT-CFR (Spearman's correlation coe cient, ρ = -0.464, P = 0.0157) (Fig. 1a). The AF distribution pattern was consistent with previous reports, with V197M AF being signi cantly lower in most Europeans than in East Asians 16 (Fig. 1a and

TMPRSS2 V197M variant in ancient genomes
The V197M variant is absent in the great apes 26,27 and in all sequenced archaic hominin genomes (Denisovan, Neanderthal). However, Tianyuan man's genotype showed that the variant was already Effect of V197M variant on TMPRSS2 protein structure We used 3D protein models to investigate the effect of V197M on TMPRSS2. V197M increased energy scores (dDFIRE 29 , nDOPE 30 ) more than wild type (Table 1), suggesting reduced stability. We used two homology modeling tools (Robetta 31 , I-TASSER 32 ) (Fig. 2) and transmembrane serine protease hepsin (PDB ID 1Z8G chain A) 33 as the template (Supplementary Fig. 6). The resultant model contains both SRCR and nearby peptidase S1 domains of TMPRSS2 (Fig. 2) because the former was too small for modeling. Despite only minor structural changes to the SRCR domain (Fig. 2), V197M had a consistently destabilizing effect in TMPRSS2 (Table 1). A further indication of reduced stability in mutants was a decrease in the favored region of the Ramachandran plot. Seven computational protein-stability prediction tools con rmed the V197M variant as destabilizing (Supplementary Data 8).

Discussion
This study has limitations. First, we only used public genome databases and variant frequency data that are not directly linked to COVID-19 patients and CFR. Furthermore, we could not completely normalize AT-CFR with relevant covariates, such as lockdown measures, mask availability, medical care standards, within-population or within-fatal-case age ratios, and SARS-CoV-2 test availability. However, we tested the Spearman's correlation between AT-CFR and thirteen socio-economic variables such as population density and Gross Domestic Product (GDP) per capita in a pairwise manner and found that only the proportion of the elderly (65 years and older) and the proportion of female smokers had signi cantly positive correlations (Supplementary Fig. 7). Another limitation is the lack of variant frequency data on chromosome X, absent from many public databases such as PGG.SNV, even though the X chromosome contains a key player, ACE2 14,15 . Notably, our protein structure modeling showed that TMPRSS2 and the template had a low sequence identity (32.49%). However, we con rmed that the V197M variant region of SRCR remained extremely consistent (Supplementary Fig. 6). Furthermore, ancient G8V data relied on sparse whole-genome-sequencing resources originating mainly from Europe and Russia, dated 2,000-1 BCE (Supplementary Data 9); these turned out too small to be conclusive. Finally, base-calling processing biases (e.g., haplodized ancient genome sequences) are a distinct possibility.
A previous report has noted that Europeans have signi cantly lower V197M AF than East Asians, a pattern speculated to be associated with COVID-19 CFR 16 . In contrast, G8V has not been linked to ethnicity-relevant SARS-CoV-2 susceptibility or COVID-19 outcomes until this study. Although we observed a signi cant correlation between the AFs of these two TMPRSS2 variants and AT-CFR (Fig. 1), correlations between AFs and infection cases (per million) were non-signi cant (Spearman's correlation V197M: P = 0.132; G8V: P = 0.165) (Supplementary Fig. 1). One likely explanation is that infection cases are a more complex parameter than CFR. Alternatively, CFR in infectious diseases may re ect the importance of genetic factors more than infection rate 35 . To verify this hypothesis, however, we require further studies investigating genomes, infection, treatment, and CFR data of COVID-19 patients.
Our evaluation of protein structural stability predicted that V197M destabilizes TMPRSS2 (Table 1, Supplementary Data 8). Unfortunately, we could not perform the same analysis on G8V because we lacked a homology modeling template. Our evidence (evolutionary conservation, protein domains) is insu cient to ascertain that G8V signi cantly affects TMPRSS2 protein structure and overall SARS-CoV-2 infection. However, one report has indicated that G8V affects residue torsion angles 36 . The resultant exibility reduction is more likely to affect TMPRSS2 interactions with ACE2 and the SARS-CoV-2 spike protein 36 . We suspect V197M and G8V variants to be related to the overall TMPRSS2 gene expression, however, we could not validate it.
In line with previous reports, we suggest that V197M acts to indirectly compromise the binding a nity of TMPRSS2 to SARS-CoV-2 spike protein and ACE2 [35][36][37] . This implies a protective role of the V197M variant against SARS-CoV-2 infections, but neither we nor previous researchers 36 Where N is the number of days which showed < 2,500 con rmed cases on each country, a n is a weight of T-CFR on day n, T-CFR n is T-CFR on day n, c i is the number of con rmed cases at day i.    TMPRSS2 protein structure of both wild type and mutant type (V197M), predicted with homology modeling using hepsin (1Z8G) template from the PDB database.