Detrended Fluctuation Analysis and Discrete Wavelet Transform for Similarity Comparison of RNA Secondary Structure

Background: Ribonucleic acid (RNA) is an important biological macromolecule. Through in-depth studies of RNA, its function has been increasingly discovered. The function of RNA is mostly dependent on its secondary structure because of its conserved nature. The discovery of an approximate relationship between two RNA secondary structures can help to understand their functional relationship better. This discovery can also help in exploring many unknown functions. Currently, RNA secondary structural similarity analysis methods are mainly divided into alignment-based methods and alignment-free methods. Alignment-free methods can obtain similarities and diﬀerences among RNA secondary structures more quickly and more accurately than alignment-based methods. Results: In this paper, a novel alignment-free method based on the triple vector curve representation of RNA is proposed. A combinational method involving a discrete wavelet transform and detrended ﬂuctuation analysis (DFA) with a sliding window is used to generate the distance matrix. Finally, a phylogenetic tree is constructed using the distance matrix. Experiments are performed on RNA viruses and non-coding RNA datasets, and the phylogenetic trees generated by diﬀerent methods are compared. The results show that our method yields more accurate results in the comparison of RNA secondary structures. Conclusion: The method in this paper enables a more accurate analysis of the similarities between RNA secondary structures. This method has certain application value in the comparison of RNA secondary structures, especially in the analysis of longer RNA sequences.


Background
The secondary structure of RNA is a double-stranded structure constructed according to the principle of complementary base pairing. In RNA, U (uracil) is complementary to A (adenine), and G (guanine) is complementary to C (cytosine). RNA is usually transcribed from DNA and acts as a bridge between DNA and proteins [1]. The versatility of an RNA molecule depends on its secondary structure. RNAs with similar structures tend to have similar functions or properties, but the opposite does not necessarily hold true. Many RNA molecules are conserved at the structural level, but they have little sequence similarity. Therefore, the comparison of RNA secondary structures is key to elucidating their functional and evolutionary relationships. Most recent studies have focused on RNA secondary structure prediction [2][3][4], and comparisons of RNA secondary structures have not yet been sufficiently studied. At the present stage, the comparison methods for RNA secondary structures are mainly divided into two types: alignment-based methods and alignment-free methods. Alignment-based methods mainly rely on an RNA secondary structure represented by a string or tree [5][6][7][8][9][10]. The Sankoff algorithm uses the free energy minimization method to synchronize the folding and alignment of two or more RNA sequences. However, it is difficult to achieve. Therefore, to be more effective in RNA secondary structure alignment, a number of improved algorithms were subsequently developed , such as Consan and, Dynalign [11], PMcomp [12], Stemloc [13], Foldalign [14], locARNA [15], SPARSE [16], MARNA [17], FoldAlignM [18], Murlet [19], CARNA [20] and RAF [21]. In response to the unsolved problems of the Sankoff algorithm, Consan developed a good unified scoring algorithm, a pair-SCFG structural alignment algorithm, for paired alignment and folding. Dynalign obtained a common structure with low free energy by comparing two sequences that lack sequence identity and free energy minimization. PMcomp uses McCaskill s method to calculate the basic paired probability matrix from which the paired and progressive multiple alignments are calculated. Due to the nature of the locARNA method being limited to high complexity, SPARSE introduced the algorithm "sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)", which has lower time complexity and improved accuracy. Alignment-free methods divide the process into two parts: folding and alignment. Among them, some tree-based methods, such as RNAforester [22], RNAdistance [8], RNAStrAt [23] and RNApdist [24], are widely used. The RNAforester tree comparison algorithm computes the local similarity in RNA secondary structures. RNAdistance calculates the similarity of the RNA secondary structures by measuring the editing distance of the tree. RNApdist compares the RNA secondary structures based on base-pairing probabilities. The Vienna RNA package can now be used to implement both the RNAdistance and RNApdist methods [25,26]. Notably, these comparison algorithms have high time complexity. General alignment-free methods are usually based on the numerical representations of RNA secondary structures, followed by the development of the many graphical representations of RNA secondary structures. The visualization of an RNA secondary structure using a graphical representation is more intuitive, thus providing a new way of thinking about the comparison of RNA secondary structures. Previously, researchers used eight symbols to represent RNA, but this representation was accompanied by a loss of information [27]. A feature sequence may correspond to diverse RNA secondary structures. That is, the feature sequences obtained by such a method are not unique. To reduce the loss of information, based on a 4×4 matrix of RNA sequences proposed by Randic [28], Yao used four horizontal lines and eight symbols to represent the RNA secondary structure [29]. The method avoids the loss of information due to the crossover and overlap of curves, but the information loss caused by the limitations of feature invariant extraction still exists. Then, a method for RNA secondary structure visualization with no information loss was proposed by Randic [30]. In [31], Li proposed a novel representation of RNA secondary structures (TV-Curve) and introduced a wavelet decomposition to compare RNA secondary structures.
Wavelet analysis is a synthesis of ideas in pure mathematics, physics and engineering. A commonly used windowed Fourier transform, such as the short-time Fourier transform (STFT), analyzes the signal with a fixed sliding window. Obviously, fixed-length sliding window processing is not suitable for all signals. The linear time-frequency analysis for non-stationary signals should have different resolutions at different positions in the time-frequency plain; in other words, it should be a multi-resolution analysis method. Wavelet transform is a multi-resolution analysis method. Furthermore, wavelets are a fairly simple and effective mathematical tool that have been widely used. [32] combined a wavelet transform with a neural network to achieve highly accurate machine fault diagnosis. In [33], wavelet analysis was used to identify membrane protein types. It was also used to realize RNA secondary structure similarity analysis. In practical applications, especially when implementing the wavelet transform on a computer, the signal must be discretized and analyzed by the discrete wavelet transform. The discrete wavelet transform was used for the feature extraction of skin lesions to obtain the best combination [width=7cm] Fig.1(a).eps of features [34]. By detecting and analyzing protein secondary structures by discrete wavelet transforms, the correlation of the amino acid sequence and secondary structure was tested by the hydrophobic value of the amino acids [35]. The fractal dimension (FD) reflects the validity of the space occupied by a complex form, which is a measure of the irregularity of a complex form [36]. FD acknowledges that various parts of the world may be similar in some way to the whole region under certain conditions or processes, and it recognizes that changes in spatial dimensions can be discrete and continuous [37]. [38] combines empirical modal decomposition and multi-fractal detrended fluctuation analysis to study the fractal characteristics of harmonic signals. To address the multi-fractal characteristics of hydrographic data, Zhao established a wavelet method and a multi-fractal detrended fluctuation analysis model to study the multi-fractal characterization and simulation of river levels [39]. Additionally, the fractal dimension has been introduced in the studies of biological molecules. Yu exploited the hydrophobicity of amino acids and fractal analysis to classify the protein structure [40]. In [41], Yang applied the fractal dimension to protein sequence similarity analysis and obtained a high degree of similarity, which has also been recently used to determine the distribution of purines and pyrimidines in the miRNAs of humans, gorillas, chimpanzees, mice and rats [42].
In this paper, we propose a novel combined algorithm. Based on the characteristics of DWT multi-resolution analysis and the universality of FD, DWT is used to analyze the feature vectors of RNA secondary structures, and FD is used to quantitatively characterize their geometric structure features. The purpose of this paper is to explore the application of DWT and fractal dimension in the comparison of RNA secondary structures.
The paper is organized as follows. We outline the basic concepts and theory of the fractal dimensions in Section 2. Our algorithm is presented in Section 3. Numerical experiments and the results are then given in Section 4. Section 5 is the conclusion.

Results
In this section, eight RNA viruses with slight differences in structure are selected to judge the application of our method to sequences with small structural differences. Their full gene sequences are used here, and their secondary structures are predicted by the Vienna RNA package. The eight viruses include the alfalfa mosaic virus (ALMV), apple mosaic virus (APMV), citrus leaf rugose virus (CiLRV), citrus variegation virus (CVV), elm mottle virus (EMV), lilac ring mottle virus (LRMV), prune dwarf ilarvirus (PDV) and tobacco streak virus (TSV) [43]. Information on these eight viruses is presented in Table 1. Fig. 1 shows the DWT and DFA on the TV-Curve of the EMV virus sequence using our method. Constructing the experimental environment by setting the wavelet level to 3 and the window width to 17, we obtain the distance matrix of eight viruses generated by detrended fluctuation analysis in Table 2. Next, as shown in Fig RNAdistance, MEGA software [44] and Li s method. The phylogenetic tree generated by the detrended fluctuation analysis method is based on the distance matrix  In addition, an experiment containing 11 non-coding RNAs is performed to test the applicability of our method in comparing the similarities between noncoding RNAs. Eight sequences are randomly selected from the ncRNAs in the RFAM database; additionally, three human ncRNA sequences from the NONCODE database are randomly sampled for the experiment. The information of these ncR-NAs is provided in Table 3, and the distance matrix generated by the experiment is shown in Table 4. The results are shown in Fig. 3.

Discussion
A detailed explanation of the phylogenetic trees is given below; it is apparent from Fig. 2 that the phylogenetic tree shown in Fig. 2(a) is more similar to the standard phylogenetic tree, shown in Fig. 2(d), generated by MEGA software. First, PDV and APMV have a high similarity, which is also clearly reflected in Table 2. However, PDV and APMV in Fig. 2(b) and Fig. 2(c) are not the most similar of all sequences, and they deviate from the standard phylogenetic tree generated by MEGA software. In addition, CVV is also very similar to PDV and APMV, but CiLRV is not similar to them; hence, it is not reasonable to place CiLRV close to them in Fig. 2(b). Finally, Fig. 2(c) does not reveal a higher similarity between CiLRV and ALMV.
In conclusion, compared with other methods, our proposed method can obtain the similarities of RNA secondary structures more accurately.
Furthermore, the phylogenetic trees were compared using the classical Robinson-Foulds (RF) metric [45] to calculate the Pearson correlation efficiency between the tip-to-tip distance matrices of the two trees. A and B are two n × n matrices generated from the two phylogenetic trees being compared. The Pearson correlation [width=1.5in] Fig.3(a).eps (a) [width=1.5in] Fig.3 Fig.3(c).eps (c) [width=1.5in] Fig.3 coefficient between the two matrices is calculated as follows: where A is the mean of matrix A and B is the mean of matrix B. The Pearson correlation efficiencies between the evolutionary trees generated by the different methods and those generated by MEGA software are shown in Table  5. The Pearson correlation coefficient between the evolutionary tree generated by Li s method and the evolutionary tree generated by MEGA software is 0.4542, the Pearson correlation efficient between the evolutionary tree generated by the RNAdistance method and the evolutionary tree generated by MEGA software is 0.6334, and the Pearson correlation efficient between the evolutionary tree generated by the DFA method and the evolutionary tree generated by MEGA software is 0.6345. Obviously, our approach is closer to the evolutionary tree generated by MEGA software than the other approaches. In the second experiment, as indicated in Fig. 3, the experimental sequences are accurately divided into three families in Fig. 3(d), tRNA, RNase P RNA, and a group of human ncRNA in the NONCODE database. Fig. 3(d) shows that the N T 033777.3 sequence has a high similarity to the tRNA family, which is also reflected in Fig. 6(a) and Table 4. In Fig. 3(b), N T 033777.3 and M F 489813.1 are classified into a group, and N T 033777.3 is the sequence that is the least similar to the others in the tRNA family, which is unreasonable. N T 033777.3 was mistakenly placed in a group close to the NONCODE human non-coding RNA in Fig. 3(c). Based on the above analysis, the method based on DFA and DWT is an effective algorithm for RNA secondary structure similarity analysis. The Pearson correlation efficiencies between the evolutionary trees generated by the DFA method, RNAdistance, Li s method and the evolutionary tree generated by MEGA software are shown in Table 6. Evidently, the results obtained by our method in this experiment are more accurate.

Conclusions
This paper proposes a hybrid method for the similarity comparison of RNA secondary structures. The algorithm is based on the existing RNA triple vector representation, uses DWT to process the feature sequences, and captures the fractal characteristics using the DFA method. Compared with several commonly used RNA comparison methods, the approximate relationships between the RNA secondary structures obtained by the DFA and wavelet transform method are close to the actual relationships. However, the secondary structures predicted by the minimum free energy in the Vienna RNA package is not optimal, and finding the optimal secondary structure for RNA rapidly and efficiently remains a challenging problem.
In addition, the method is not yet excellent for analyzing shorter RNAs, and systematic studies will be carried out on RNAs of different lengths and characteristics. Figure 4 The flowchart of our method.

Methods
In this paper, a combination of the discrete wavelet transform, detrended fluctuation analysis and sliding window is proposed to evaluate the similarity of RNA secondary structures based on the TV-Curve of RNA. The TV-Curve of RNA is used to construct the feature vector, which is analyzed by the discrete wavelet transform, and then the sliding window is introduced. For a signal of fixed length within the window, the fractal dimension is calculated using the detrended fluctuation analysis (DFA) method to obtain the distance matrix. Finally, the corresponding phylogenetic tree is obtained. The algorithmic process is shown in Fig. 4.

Basic concepts of fractal dimension Theoretical Fractal Dimension
The German mathematician F. Hausdorff studied the properties and quantities of singular aggregates and proposed the concept of fractal dimension called the Hausdorff dimension in 1918 [46]. It is one of the most important fractal dimensions. The Hausdorff dimension can be used for any set. The theory of the Hausdorff dimension is as follows: Suppose a non-empty subset A of an n-dimensional Euclidean space R n , where the diameter of A is the maximum distance of any two points in A, then: Assume C ⊂ R n and 0 ≤ s ≤ ∞, for any δ > 0 , The above equation refers to the coverage of C whose diameter does not exceed δ, and the sum of these diameters is minimized to the power of s. The symbol inf {·} takes the minimum value of {·}. As δ decreases, the clusters covering C in equation (3) also decrease, and the infimum H s δ (S) increases accordingly. Moreover, when δ → 0, it tends to a limit, denoted as For any subset C in R n , the limit exists, and the limiting value is commonly 0 or ∞, and H s (C) is called the s − dimensional Hausdorf f measure of C. In equation (3), for a given set C ⊂ R n and δ < 1, H s δ (C) does not increase with respect to s, so it can be shown from equation (4) that H s (C) does not increase either. Further, if t > s, and {A i } is the δ-cover of C, then Take the infimum, In In addition, if s = dim H C, H s (C) satisfies,

Algorithms of Fractal Dimension Calculation
Although the fractal dimension is widely defined, it is subject to some limitations in practical applications. Due to its high computational complexity and the discrete and finite nature of the scale factor, we usually use a number of algorithms to approximate the fractal dimension. To date, many methods have been developed to compute fractal dimensions, such as Katz s algorithm [47], Petrosian s algorithm [48], Higuchi s algorithm [49], multifractal detrended fluctuation analysis (MFDFA) [50], detrended fluctuation analysis (DFA) [51] and fluctuation analysis (FA) [52]. In terms of processing time, Katz s algorithm and the DFA method are relatively slow. However, in terms of processing accuracy, Higuchi s algorithm, detrended fluctuation analysis (DFA) and fluctuation analysis (FA) are relatively accurate. In the following, we will introduce Higuchi s algorithm, fluctuation analysis (FA) and detrended fluctuation analysis (DFA).

Higuchi s Algorithm
Treating an RNA sequence as a one-dimensional signal s 1 , s 2 , ..., s N , construct a subsequence as For m = 1, ..., k, where N is the length of the signal, and (N −1) ( N −m k k) is the normalization term. With the given measurement scale, the approximate length of the signal with starting position m is computed, m = 1...k.
Thus, the approximated signal length can be obtained by: Finally, the fractal dimension h * of the signal is calculated by the least squares method [53]: where c is the bias.

Fluctuation Analysis (FA) method
The fluctuation analysis (FA) method is commonly used to calculate the Hurst parameter for time series. It works as follows. Suppose X = {X i , i = 1, 2, ..., N } is a random process with a mean value of µ and a variance of σ 2 . First, we remove the mean from the signal and represent the new signal as Then, construct a new one-dimensional signal y = {y n , n = 1, 2, ..., N } with the sum of the first n terms of x, Test whether the following formula satisfies the power law formula, That is, whether log 2 F (2) (m) and log 2 (m) are satisfied: where c is a constant. From the above equation, we can obtain the Hurst parameter H.
In the end, we can obtain FD by the following formula: Detrended fluctuation analysis (DFA) method Owing to the complicated characterization of long-range correlations and power-law statistics that RNA TV-Curves have, it follows that comprehensively describing and studying the internal features of non-stationary signals such as RNA TV-Curves by traditional methods is difficult. For complex, highly non-stationary signals, time series analysis methods derived from statistical physics are currently used. The DFA method can effectively eliminate various unknown trends in the signal, thus avoiding the interference caused by noise and signal instability. The DFA method has also been widely used in image processing and other fields [54]. The DFA method has been proven to be an effective method to analyze non-stationary signals. In summary, as the FA, the signal x r , r = 1, ..., N is preliminarily processed, and the mean value x is removed. The sum of the first term i of x r as is the i of a new sequence: Next, the signal Y (i) is equally split into data boxes of length k. k, which is the measurement scale, is the number of points per data box.
In each box of length k, a least-squares line is fit to the data. As a result, fitting data box y k (j) is obtained. Then, we detrend the signal Y , In the end, the different data obtained for each box are integrated and averaged, where, F 2 DF A (k) = 1 k k j=1 Y k 2 (j). The scaling exponent α is the slope of the line of fit of F 2 (k) and k. The fractal dimension is calculated by: Graphical characterization of RNA secondary structure Dot-bracket representation The sequence of RNA consists of the bases C, G, U, and A. The primary sequence does not contain structural information. During the exploration of the RNA function, two bases combine with each other to form base pairs. Typically, the base pairs are A-U, G-C. Thus, the secondary structure of RNA is formed. Currently, 'the dot-bracket representation is commonly used to signify the secondary structure of The feature sequence is read from 50-terminal to 30-terminal, and the vectors are sequentially connected to obtain the TV-Curves. For example, Fig. 6 demonstrates the secondary structure of EMV and the corresponding TV-Curve. An RNA sequence of length N generates a TV-Curve with an X-axis length of 3N. An RNA sequence and its secondary structure may produce only one TV-Curve, and similarly, a TV-Curve may represent only one RNA secondary structure. Moreover, the TV-Curve has no information missing from the process of feature extraction of the RNA.
Comparison of RNA secondary structures based on the RNA TV-Curve by detrended fluctuation analysis Discrete Wavelet Transform The wavelet transform decomposes data, functions or operators into components of different frequencies, and then studies each component on the corresponding scale [55,56]. We now study the wavelet transform within the range of signal analysis. Similar to the discrete Fourier transform, the discrete wavelet transform is also a time-frequency description method. The signal is decomposed into two characters: the approximation coefficients (AC) and the detailed coefficients (DC). For the signal x(t), we use two functions at the same time, wavelet function Ψ(x) and scaling function Φ(x). Φ(x) for the s(t) overview approximation and Ψ(x) details the approximate function of x(t). Φ(x) can be formulated as: where h n is the low-pass filter. The wavelet function Ψ(x) can be calculated by: Two sets of orthogonal functions can be extracted from the wavelet transform: shifted wavelet functions Ψ (x − k) and scaling functions Φ (x − k). For any signal x = a j k , the approximation coefficients (AC) and detailed coefficients (DC) in the next level can be quickly calculated as follows: a j+1 k = i∈Z a j i h i−2k , j = 0, 1, 2, ...
The decomposition is taken level by level. The lengths of AC and DC obtained after the progressive decomposition are always half the length of the level sequence antecedently. The processing of RNA secondary structures using DFA allows for the description of the overall trends and general characteristics of the signal, as well as the local details.
Windowed detrended fluctuation analysis (DFA) method If the fractal dimension is calculated by the DFA of the whole signal, only one scalar can be obtained. The sliding window only calculates the signal values in a fixed length window [57]. We introduce a sliding window that moves along the signal and calculates the fractal dimension within the window. Calculating the fractal dimensions along a set sliding window yields a feature vector, not just a number. The length of this feature vector is N − k − 1, where N is the signal length and k is the window width. m indicates the starting position of the signal. A least-squares line is fit to the data in each window of length k. As a result, a fitting data box y m k (j) is obtained. Then, we detrend signal Y : For each window, calculate the following formula, Next, the average is calculated, From the above, the lb − lb graph of F 2 (k) − k is drawn for different values of k, it is fit to a straight line, and the scaling exponent α is obtained. The fractal dimension is calculated by:

Declarations
The manuscript has not been published before and is not being considered for publication elsewhere. All authors have contributed important intellectual content for the creation of this manuscript and have read and approved the final manuscript. We declare that there are no conflicts of interest.
Ethics approval and consent to participate Not applicable

Consent for publication Not applicable
Availability of data and materials The datasets analyzed during the current study are available from the RFAM and NONCODE databases, http://rfam.xfam.org/, http://www.noncode.org/.