Development of A Method for the Detection of Antagonism in the Microbial Community and A Toolbox for Identical Sequence-Level Analysis


 Competition or antagonism is common in microbial interactions. Therefore, computationally identifying those relationships in the microbial community is in demand. Here, we defined exclusive correlation value (ECV) to better detect mutually exclusive relationships during microbiota analysis. We present METATEA (https://sourceforge.net/projects/metatea/) as a handy toolbox for the calculation of ECV and the analysis of community members with identical 16S rDNA sequences to define intraspecies groups. We analyzed community data related to inflammatory bowel disease to validate the utility of our ECV and METATEA. ECV identified strains antagonistic to disease-associated bacteria, thus providing their potential as probiotics in precision medicine. Klebsiella pneumoniae showed exclusive relationships with various bacteria, and its proportion was associated with microbial diversity and Crohn’s disease. Further, many putative pathobionts were detected using METATEA, some of which were reported to be isolated from other body sites and cause illness to the host.


Introduction
Complicated and intricate networks of interactions govern microbial communities in an ecosystem. They are more often competitive or antagonistic than mutualistic. Understanding these relationships is important not only for basic understanding of the inner workings of a given microbial ecosystem, but for biotechnological application to medicine, agriculture 1 , and environments. Various computational and statistical methods have been developed to assess microbial relationships. For example, Pearson correlation coe cient (PCC) 2 , Spearman's rank-order correlation coe cient (SCC) 3 , or maximal information coe cient (MIC) 4 had been commonly adopted in most of the previous studies to analyze relationships across microbes. However, the methods are optimized for synergistic relationships between microbes and variables with normal distribution. PCC is not optimized for non-linear relationships. In SCC, weak mutually exclusive relationships, in which the quantity of one variable negatively affects the other variable, can be detected, but non-coexisting relationships, in which the existence of one variable affects the other variable, cannot be distinguished. Although a variety of correlations can be detected using MIC, scientists have not been able to selectively distinguish exclusive relationships from others. New approaches to better detect strong mutually exclusive (almost non-coexisting) relationships across microbial members are desirable. Further, exclusive relationships may not be fully indicated by rankbased measures due to exponential growth of microbes. A convenient method to investigate exclusive relationships could enhance the development of probiotic products. For example, certain probiotic bacteria are considered to reduce intestinal colonization of pathogens 5,6 . Additionally, detection of exclusive relationships with harmful microbes will lead us to precision medicine and reduced adverse effects of probiotic agents.
Several tools such as mothur 7 and QIIME 2 8 are available for scrutinizing microbial communities in environment. Microbial members in the community are most often analyzed at the operational taxonomic unit (OTU) level that is roughly equivalent to the genus level. However, microbes in the same OTU may have different ecological interactions and cellular functions 9 . For example, while Escherichia coli is an important member of the normal gut microbiota, some serotypes are pathogenic 10 . Therefore, 16S analysis below the species levels can aid better understanding of complex diseases, such as in ammatory bowel disease (IBD) that can be divided into two major types, Crohn's disease (CD) and ulcerative colitis (UC), and other forms including indeterminate colitis (IC), and help detect speci c bacterial groups associated with the disease type. Thus, assignment of amplicon sequence variants instead of OTUs has been implemented in QIIME 2 8 .
In this study, we developed METATEA (METAgenomics Toolbox for E cient Analysis of microbial communities), a toolbox that enables microbiota analyses at the identical sequence level and evaluates mutually exclusive relationships between microbial members by calculating exclusive correlation value (ECV). METATEA supports various functions, such as basic quality control processes, calculation of sequence proportions, calculation of ECV, statistical analyses, etc. Output le formats were designed to easily perform network analyses and statistical tests using other tools, such as R or MATLAB. In this study, 428 IBD samples, derived from a previous microbiota study 11 , were analyzed for performance assessment of METATEA and validation of ECV (Fig. 1). We presented the differences between analysis at sequence level and that at higher taxonomic level, bacterial sequences associated with disease status, and the relationship between Klebsiella pneumoniae and bacterial diversity.

Results
Introducing METATEA. METATEA aims to be a toolbox for easy analyses of the microbial community using amplicon sequence data (and whole genome sequence data in the future) at the sequence level and for the detection of antagonistic relationships. Since intraspecies variation in even one gene can change the virulence of a pathogen 12 , we developed pipelines for analysis at the sequence level (Fig. 2).
Analyzing 428 IBD samples as a case, we had challenges from sequencing errors. Sequencing errors may increase the number of unique sequences, which are weakly proportional to the total numbers of reads (Supplementary Table S1); we assumed such erroneous reads to generally occur once or twice in a sequence le. Therefore, we developed ltering step and normalization step for the removal of rare sequences and the adjustment. Also, various quality control steps were developed to accurately handle reads with Ns and short/long reads at the sequence level. For example, most ambiguous bases (Ns) from the pyrosequencing platform can be corrected 13 using METATEA. For the detection of microbial antagonism, METATEA implements the calculation of ECV (see Methods). Being more user-friendly, METATEA supports matrix output format. The format was designed to easily perform further analyses using other statistical tools like R. METATEA is written in Perl, and we support both Windows-compatible and Linux-compatible executables. METATEA Linux version was compiled from C code for relatively fast analyses of large data sets. METATEA is free, and its detailed manual is available from the project website (https://sourceforge.net/projects/metatea/). Comparison between taxon-based and sequence-based analyses. We determined the proportional threshold (≥ 0.1%) and extracted 3,202 unique sequences and00 calculated their proportions. To validate our method, the microbial community structure was compared to that in the previous study 11 , from which our data set was derived. Although the previous study had used different methods and samples relative to our current one, the regression line in Supplementary Fig. S1 showed our present results to be in accordance with the previous one's with respect to community composition (r 2 = 0.989 in HC and 0.973 in CD).
Generally, analysis results at the sequence level agreed with, and even outperformed, the results at family/genus/OTU levels. First, increase/decrease of bacteria, according to the disease subtypes, at the sequence level generally matched the results of a previous study 11 at the genus or family level ( Supplementary Fig. S2). For example, ve bacterial sequences in the Pasteurellaceae family increased in CD, as reported in a previous study 11 . However, Bacteroidales is comprised of various bacteria, whose increase was offset by its decrease. Second, the risk of CD was predicted using random forests at both sequence and OTU (1% and 3% dissimilarity) levels. Prediction at the sequence level outperformed that at the OTU level in terms of the area under the receiver operating characteristic (ROC) curve (AUC) shown in Fig. 3. Third, the co-occurrence of CD-associated organisms in the previous study generally coincided with the correlation network in this study ( Supplementary Fig. S3), despite the use of different samples and methods. For example, both sequence identi ers S0276 (Fusobacterium periodonticum) and S0345 (Veillonella dispar) were increased in CD and showed possible agreement (Supplementary Table S2). However, we could also discover a few exceptions, like S0066 (Clostridium symbiosum; Lachnospiraceae) and S0346, which were increased in CD and showed possible agreement, although Lachnospiraceae was co-excluded in CD in the previous study 11 .
Analyses of the microbial community composition. The most frequent sequence identi ers in our IBD data set were S0078 (Bacteroides vulgatus), S0052 (Bacteroides fragilis), S0029 (E. coli), S0051 (Faecalibacterium prausnitzii-like), and S0152 (Bacteroides dorei) in descending order (Supplementary Table S2). The microbial sequences were the main coe cients in our PCA analysis ( Supplementary Fig.   S4a). Generally, HC samples clustered in the center, slightly toward sequence S0078, in the PCA analysis. PCA results coincided with the heat map results ( Supplementary Fig. S4b). Single bacterial sequences did not show high proportions in HC samples (≥ 50%), whereas certain CD and UC samples did show high proportions (≥ 60%) of sequence S0029 or S0046. Bacterial community structures in CD (pMANOVA: P < 0.001) and UC (pMANOVA: P < 0.001) were signi cantly different from that in HC. Several intraspecies variations in disease subtypes were identi ed, and their average proportions changed differently according to disease subtypes (Supplementary Table S2). Particularly, intraspecies variation of B. vulgatus, E. coli, F. prausnitzii, Parabacteroides distasonis, and Kineothrix alysoides in average proportion, according to disease subtypes, were prevalent, whereas variation of B. dorei, Haemophilus parain uenzae, Prevotella copri, and Fusobacterium nucleatum were less obvious ( Supplementary Fig.  S5).
Detection of disease-associated community members. Multiple Mann-Whitney U tests were performed using METATEA to identify sequences associated with disease subtypes (Supplementary Table S3). Generally, H. parain uenzae, Dialister pneumosintes, F. periodonticum, and many other sequences were signi cantly increased in CD. H. parain uenzae-like, F. prausnitzii-like, Sutterella massiliensis-like, and many others were signi cantly increased in UC. We newly discovered Veillonella atypica, Pseudomonas lini, and Hyphomicrobium zavarzinii to possibly be related to CD, IC, and UC, respectively, probably due to high-resolution analyses at the sequence level. Lachnospiraceae showed variations in disease subtypes ( Supplementary Fig. S6). Intraspecies variations in Z scores were apparent, particularly among sequences of low proportions (Supplementary Table S2). For example, although F. prausnitzii (S0376) and some F. prausnitzii-like sequences (S0051) were increased in HC, other F. prausnitzii-like sequences, such as S0397 and S2816, were increased in CD and UC, respectively (Supplementary Table S3). Generally, sequences with high Z scores (≥ 2.3264) in CD and UC compared to those in HC might be those of bacteria associated with various sites of the body or opportunistic environmental bacteria, rather than members of the normal microbiota in the gut (Supplementary Table S4). Some of these aberrant bacteria of putatively external origins are reportedly related to various diseases like endocarditis 14 , carcinoma 15 , and abscess 16 . For more accurate prediction of CD risk, using more biomarkers at the identical sequence level may be an appropriate approach rather than using a small number of powerful biomarkers. For example, sequences like S0362 (Roseburia inulinivorans) can be the most powerful biomarkers to predict CD risk (Fig. 3b). However, even the highest proportion of S0362 cannot con rm HC status ( Supplementary Fig. S7).
Comparison between ECV, PCC, and SCC. ECV was designed to selectively identify highly exclusive relationships among all microbial pairs (Fig. 4). This ability is very important in terms of microbial data because most microbial pairs can show widely scattered inversely proportional correlations as shown in Fig. 5 around the median values of ECV, PCC, and SCC. ECV successfully identi ed strong antagonistic (almost non-coexisting) relationships between K. pneumoniae (S0390) and F. prausnitzii-like (S0764, S1650, and S25900) sequences around the maximum values (Fig. 5a). Although the negative relationships between K. pneumoniae and F. prausnitzii-like sequences are to be experimentally veri ed, K. pneumoniae increased in disease samples, whereas F. prausnitzii increased in control samples in previous studies 17,18 .
We then compared ECV, PCC, and SCC for their capabilities to identify antagonism. Widely used PCC is one of the robust means to detect positive and negative linear correlations. Although PCC nds well cooccurring microbial members having positive correlations at the maximum values (~ 1), strong negative relationships (~-1) were undetectable using PCC (Fig. 5b). Among the microbial pairs (B. vulgatus and B. dorei as well as closely related sequences) with extremely low correlation coe cients, none had the coe cients lower than − 0.3 (Supplementary Table S5). SCC assesses monotonic relationships using a rank-based measure (whether linear or not), while PCC assesses linear correlations. SCC identi ed microbial pairs with inversely proportional correlations better around the minimum values than PCC (Fig. 5c). However, strong antagonistic relationships could not be identi ed even with SCC. For example, S0535 (Haemophilus pittmaniae) and S1732 (E. coli-like) showed strong antagonistic relationships, and their SCC was 0.039. Pairs (Lachnospiraceae and Oscillibacter ruminantium-like; Blautia wexlerae and Escherichia coli-like; Bacteroides uniformis and Escherichia coli-like) with correlation coe cients from the minimum value were − 0.367, -0.346, and − 0.336.
Relationships between community members. Relationships across different sequences of the same species may be associated with disease subtypes and be useful for predicting CD risk. For example, IBD samples were categorized into two groups, as shown in Supplementary Fig. 8a, and high ratio of S0103 (B. vulgatus-like) to S0092 (B. vulgatus-like) might be related to CD. We successfully detected mutually exclusive relationships using ECV at three different OTU levels (0, 1, and 3% dissimilarity); S0390 (K. pneumoniae), 1%OTU281, and 3%OTU010 generally had signi cantly exclusive relationships with a variety of bacteria (Supplementary Table S6-S8; Supplementary Fig. S9) and increased in CD than in HC (Fig. 6). Particularly, high proportions of S0390 might be related to reduced bacterial diversity ( Supplementary Fig. S10). Different microbes showed different ECV values with certain diseaseassociated microbes. We regarded 15 sequences (average proportion ≥ 0.1%) as putative probiotics that occurred at least three times in HC than in CD, and 8 sequences were removed by the ECV formula (see Methods), The 7 putative probiotic sequences, S0342 (F. prausnitzii), S0362 (R. inulinivorans), S0463 (Blautia luti), S0464 (Clostridium amygdalinum), S0661 (Lachnospiracea_incertae_sedis), S0706 (Lachnospiracea_incertae_sedis), and S0707 (unclassi ed_Lachnospiraceae) in general showed exclusive relationships with disease-associated sequences (Supplementary Table S9). Particularly S0342 showed strong mutually exclusive relationships with E. coli whereas S0464 did with F. periodonticum (Supplementary Table S10). Additionally, ECV was designed to be less affected by sample sizes, and pairs of S0390 and other sequences showed higher ECV values in HC than in CD (Supplementary Table  S11). For example, ECV value of S0089 (Bacteroides koreensis-like) and S0390 in HC was higher than in CD, suggesting putative synergistic effects of S0390 whether it is direct or not ( Supplementary Fig. S8b). Additional studies using more clinical data would validate our observation.

Discussion
This study focused on the performance analyses of METATEA with identical sequence matching, and on ECV to identify strong mutually exclusive relationships between microbes. The high resolution of sequence-based analyses enabled us to detect missing disease-associated microbes within the same taxonomic unit. ECV identi ed strong mutually exclusive relationships between microbes, whereas PCC and SCC could not distinguish strong mutually exclusive relationships from weak ones in microbial data sets. Although intraspecies variations can be detected using QIIME 2 8 , we newly developed METATEA for the convenient calculation of ECV from microbial proportions. Also, METATEA can be easily installed regardless of the limitation of environment managers. The identical sequence matching process using METATEA can be performed without traditional time-consuming jobs, such as distance calculation and clustering compared to OTU-based tools. The e cient process will enable accurate diagnosis of diseases.
Analyzing microbial communities at the sequence level can have more merits than that at high taxonomic levels. We identi ed missing disease-associated microbes by the offset from the reduced strains within the same group. The high-resolution of bacterial populations at the sequence level enabled us to newly detect many disease-associated microbes, such as V. atypica, P. lini, and H. zavarzinii. It has been reported that IBD is linked to various diseases, such as carcinoma, endocarditis, and abscess [19][20][21] . Beyond our expectations, many of the speci c strains of disease-associated taxa were related to various sites of the body and lethal diseases, not only due to immune dysfunction 22 , but also due to various disease-associated microbes and probably due to evolutionary advantage for movement between body sites. Prediction at the sequence level outperformed that at the OTU level in terms of AUC. Also, novel relationships between strains within the same species could be identi ed.
The merits of analyzing microbial communities at the sequence level might stem from intraspecies variations. We identi ed intraspecies variation of B. vulgatus, E. coli, F. prausnitzii, P. distasonis, and K. alysoides. Especially, F. prausnitzii is reported as a potential novel probiotic bacterium for IBD although the bene cial mechanisms are still unclear 23 ; however, certain F. prausnitzii-like sequences, such as S0397 and S2814, were increased in CD and UC, respectively. Particularly, probiotics should be carefully studied at the sequence level to reduce possible side effects due to intraspecies variations. Strains within B. vulgatus species might have novel relationships: B. vulgatus-like sequences (S0092 and S0103) showed two different ratios. The high ratio of S0103 might activate NF-κB in human gut epithelial cells 24 , or in ammation of the gastrointestinal tract might result in the high ratio of S0103; more experimental studies would be required in this direction.
PCC, SCC, and ECV all e ciently detected different relationships between microbial members. Although PCC could detect positive linear correlations that correspond to co-occurrence, mutually exclusive relationships that correspond to reciprocal exclusion are poorly identi ed. SCC could exibly identify negative or positive correlations, but the strongest exclusive relationships could be missed. ECV could not exactly identify positive linear correlations, but ECV could detect the strongest exclusive relationships between microbial members. ECV can be useful for the development of precision probiotics, since probiotics may have antagonistic effects of different strengths on speci c pathogens. Custom probiotic supplements might be needed so as to prevent undesirable side effects. Additionally, ECV can be useful for understanding reduced microbial diversity and CD. Speci c sequence identi ers have strong mutually exclusive relationships with a variety of bacterial sequences and increase in CD. For example, high proportions of S0390, a sequence of K. pneumoniae, are signi cantly associated with decreased microbial diversity. K. pneumoniae is a lactose-fermenting bacterium that causes infections especially in alcoholic patients 25 , and ectopic colonization of Klebsiella spp. in the intestine drives T helper 1 cell induction and in ammation 26 . Intestinal dysbiosis and decreased microbial diversity caused by drinking alcohol are related to increased K. pneumonia 27 . Moreover, high-alcohol-producing K. pneumoniae can be associated endogenous alcohol production 28 . Whatever the exact reasons for reduced bacterial diversity and CD should be, K. pneumoniae may be associated with reduced microbial diversity and CD. Also, K. pneumoniae might have synergistic effects with other disease-associated microbes according to differences in ECV in disease subtypes.
The e cient analyses using METATEA and ECV will enable accurate risk prediction of complex diseases, identi cation of more disease-associated microbes and antagonistic relationships between microbes, indepth understanding of reduced microbial diversity, and development of precision probiotics to reduce undesirable side effects.

Materials And Methods
De nition of ECV. We de ned ECV to identify strong mutually exclusive relationships. ECV was designed to be useful for the practical detection of mutually exclusive relationships between bacteria, as much as possible, so as to exclude the effects of group size. First, we generated candidate formulae to investigate mutually exclusive relationships. A single basic formula was selected that would be less affected by group size due to the division by multiplication of microbial proportion. Then, we analyzed the data distribution and performed data visualization using public data sets and those produced in house to empirically modify the formula using the numbers of zero proportions. Let a and b be the proportions of two bacterial sequences, and n be the number of observed proportions. To remove rare bacterial sequences, we counted the number of proportions consisting of zeros, since rare bacterial sequences can be shown to be exclusive by chance because rare bacterial sequences sparsely coexist and the relationships of the sequences can be easily misunderstood as exclusive relationships. Let N(a), N(b), and N(ab) be the number of zeros in a, b, and both a and b. In this study, too rare sequences were removed by the left conditions mentioned below. However, METATEA users should apply right conditions for small sample sizes (< 20): Finally, ECV was de ned as is the main part to detect mutually exclusive relationships (Fig. 4a). (N(ab)+1) is supplementary part to lter out rare sequences, thereby removing false positives incurred by skewed distributions and outliers (Fig. 4b). (N(a)+1) and (N(b)+1) are supplementary parts to accurately estimate antagonism which were obtained after observing additional gastric microbiota and soil microbiota data sets (Fig. 4c).
Public data sets. METATEA is applicable to any 16S amplicon sequence data, and ECV is applicable to various elds of natural science. Particularly, we expect that our e cient and simple analysis pipeline, high-resolution detection of disease-associated microbes, and ECV for detection of antagonism against disease-associated microbes are suitable for analyzing gut microbiome data for clinicians. To convey the potential applications of METATEA to human gut microbiota data, we describe here the public data sets that were analyzed using METATEA. We selected 140 healthy control (HC), 214 CD, 20 IC, and 54 UC data sets in BioProject: PRJEB13679 that were relevant to the previous study 11 on IBD in order to develop sequence-supervised processes, assess the performance of METATEA, and validate ECV. Although the symptoms of IBD are heterogeneous 29 , the samples were classi ed into four subtypes based on clinical metadata: HC, CD, IC, and UC. Only the data sets of terminal ileum biopsy (Supplementary Table S1) were downloaded using SRA Toolkit (v.2.9.0), since the terminal ileum represents a transition zone to detect both the aerobic ora and anaerobic microbes 30 . Moreover, in CD risk prediction, terminal ileum outperformed other biopsy locations, such as rectum and stool 12 ; CD has been shown to frequently occur in the terminal ileum 31 .
Main processing. For objective analyses at the sequence level, abnormally short or long reads (< 175 bp or >175 bp) were removed. After BLAST analysis, 11 sequences were identi ed as human DNA sequences occurring frequently, and were used for the decontamination step using METATEA. After quality control processes, the sum of proportions of microbial members was adjusted to one. Heat map, box plots, and PCA analysis were prepared using MATLAB (R2020a). Mann-Whitney U tests were performed using METATEA to compare CD/IC/UC with HC, and the signi cantly increased disease-associated sequences were identi ed. To test for differences in microbial community structures between disease subtypes, pMANOVA tests were performed using the function 'adonis2' of the vegan package in R (v4.0.0). Weighted correlation network analysis and CD risk prediction were performed using R. To predict CD risk, 80% and 20% of data sets were randomly used for training and testing, respectively. Square roots of the sequence/OTU numbers were used to tune mtry parameters. We calculated OTU (1% and 3% dissimilarity) using Mothur (v.1.44.1) with SILVA recreated SEED database (v.138). Taxonomic classi cation was mainly performed using RDP classi er at family level. Taxonomic classi cation at species level was performed using web BLAST with 16S ribosomal RNA sequence database, and the BLAST results were con rmed by RDP classi er at genus/family level. If the BLAST results were inconsistent with RDP classi er results, we considered RDP results at genus level. In case of recent reclassi cation, we considered literature search results. Literature was surveyed to identify the relationships between IBD and other diseases, since oral bacteria are known to be related to IBD and various diseases 32 and IBD is related to various, often lethal, diseases such as cancer 33 , endocarditis 34 , and neurological disorders 35 . The maximum likelihood method was selected to construct a phylogenetic tree using MEGA-X (v10.0.5). The analysis pipeline for unsupervised analysis using a few samples is described in Fig. 2.
Comparison with a previous study and determination of a proportional threshold. To compare the results in this study to that in the previous study 11 , taxonomic analysis was performed using data from terminal ileum in Supplemental Table 3 of the previous study 11 . After quality control processes using METATEA, we assumed the sequencing errors to have increased the number of unique sequences, since the number of reads was almost proportional to the number of unique sequences (Supplementary Table S1). We found putative erroneous reads, which generally occur only once in a single le, and the number of reads with proportions ≥ 0.1% were not proportional to the number of reads in each le. Therefore, we determined the proportional threshold (≥ 0.1%) for the development of sequence-supervised processes and extracted 3,202 unique sequences (Supplementary Table S12). For METATEA users, we recommend that the threshold should be determined according to the number of reads in the le since putative erroneous reads might generally occur once in a single le in this study. Generally, threshold can be determined as But one may have to use greater proportions depending on further analyses since too big data often leads to errors when you use statistical tools.