1. Selection of Novel Candidate Housekeeping Genes Based on Transcriptome Sequencing Data
By analyzing RNA-Seq data, we acquired transcripts per million (TPM) values for all genes in Y. enterocolitica Y11. We computed both the log2(TPM) values and the maximum absolute deviation for each gene's log2(TPM). These values were then plotted in a scatter plot (Fig. 1). In this plot, genes positioned closer to the bottom-left corner are indicative of minimal expression variability due to temperature and inter-strain differences. To identify novel candidate housekeeping genes, we delineated a stringent criterion (marked by the red box in Fig. 1) and selected 10 genes that exhibited sufficient sequence conservation, relatively high expression levels, and the lowest coefficient of variation (CV) for their log2(TPM) values. The genes identified are: rpoD, thrS, ftsY, rlmL, rpoN, dnaX, glmS, lepA, argS, and glnS. To provide a clear visual comparison of the expression stability at 22°C and 37°C for the six housekeeping genes previously reported in the literature and the 10 newly identified candidate housekeeping genes, we constructed Fig. 2. The proximity between the red and blue circles represents the stability of gene expression across the two temperatures: the closer the circles, the more stable the gene expression at 22°C and 37°C.
2. Acquiring Cq Values of Housekeeping Genes in Clinical Strains Cultured at 22°C / 37°C
The subsequent step in our research was to simulate a realistic laboratory scenario by validating the expression stability of 16 genes identified in previous work at temperatures of 22°C and 37°C, using Y. enterocolitica strains. The primer sequences for these experiments are detailed in Table 2. These primers have been confirmed to effectively amplify the target genes, and their melting curves exhibit a single peak, indicating the specificity of amplification (data not shown). To lend more credibility to our experimental outcomes, we refrained from utilizing standard strains and instead employed clinical samples isolated from the feces of patients with diarrhea. Moreover, to reduce the impact of heterogeneity among different serotypes, we made a concerted effort to collect a comprehensive set of strains from four serotypes: O:3, O5:27, O:8, and O:9, securing three strains from each serotype. This resulted in a total of 12 bacterial strains for qRT-PCR validation.
After obtaining the Cq values for each strain and gene, we initially computed the sum of the ∆Cq values at 22°C and 37°C for each gene (Fidure 3). It is tentatively concluded that genes with smaller sums of ∆Cq values are more stable in their expression levels. When ranking the sum of ∆Cq values for each gene in ascending order, the four genes (top 25%) with the smallest values were rpoB, nuoB, rpoD, and glmS. However, when evaluating the stability within each of the four serotypes separately, a total of nine unique genes emerged. Among these, nuoB, rpoB, and rpoD were identified three times, glmS twice, and argS, dnaK, gyrB, lepA, and 16s rRNA each appeared once.
3. Analysis of gene expression stability by different statistical algorithms
Next, the expression stability of the 16 target candidate housekeeping genes was assessed by three dedicated analytical tools: geNorm[9], NormFinder[10], and BestKeeper[11]algorithms.
GeNorm is the most commonly used algorithm for assessing the stability of candidate housekeeping genes. It calculates a gene-stability measure value (M) by evaluating the average pairwise variation of each candidate gene in comparison to other genes. A lower M value indicates greater stability. After calculating M values for all genes, the two genes with the lowest scores are recommended as housekeeping genes. The experimental bacterial strains were categorized into four groups based on their serotypes. The M values for each gene within the three strains in each group were computed and arranged in ascending order according to their average values (Fig. 4). The top 25% genes with the smallest M values from each serotype group (four genes per group) were selected, resulting in a total of nine unique genes. Among them, glnS, rlmL, and thrS appeared three times each, rpoN appeard twice, and remaining five genes ftsY, glmS gyrB, lepA and nuoB appeared only once.
NormFinder is another popular algorithm for evaluating gene expression stability. It adopts a model-based approach to assess stability and calculates a stability value for each gene by considering both the overall variation and the variation attributed to specific sample groups. Similarly, a lower score from this algorithm indicates higher stability of the gene. The top 25% of genes with the smallest NormFinder scores were selected within each serotype group, resulting in 16 genes, out of which 10 were unique. Among them, glnS and gyrB appeared three times, glmS and ropN appeared twice, while the remaining six genes ftsY, lepA, nuoB, rlmL, rpoD and thrS appeared once.
The BestKeeper algorithm calculates the stability of candidate genes by analyzing their standard deviation (SD) of Cq values, as well as the coefficient of variance (CV), correlation coefficient (r), and p-value (p), which are also important parameters. It separately calculates the top 25% most stable genes in each serotype group, resulting in a total of 16 genes, out of which 8 are unique. Among them, rpoD is counted 4 times, ropB is counted 3 times, ftsY, glmS, and nuoB are each counted 2 times, and argS, gyrB, and lepA are each counted 1 time.
4. Comprehensive Ranking of Housekeeping Gene Stability Using the Robust Rank Aggregation Algorithm
It is evident that even for the same set of qRT-PCR data, different algorithms may yield rankings that are not entirely consistent. To obtain a composite ranking, we employed the RRA algorithm to integrate 48 ranking results from 12 clinical bacterial strains obtained through four different evaluation methods. This method led us to a comprehensive ranking (see Fig. 8). The results show that the composite rankings of the 16 candidate housekeeping genes can generally be divided into three tiers. The first tier includes glnS, nuoB, glmS, gyrB, dnaK, and thrS, which have the lowest RRA scores, indicating their consistently high rankings and good stability across various algorithms. glnS stands out in particular, with an RRA score of only 0.01, placing it in the foremost position. The second tier genes, rlmL, rpoD, and rpoB, display moderate overall ranking performance. The third tier comprises lepA, ftsY, 16s rRNA, rpoN, argS, gmk, and dnaX, which suggests that their rankings fluctuate significantly across different algorithms or they tend to rank lower more consistently.