## Illustration of Ring Repeating Unit

Polymer is a kind of long-chain molecule that cannot be represented completely for developing QSPR models, determining a proper representation of polymer structure is therefore of great importance and remains challenging. Monomer and RU (or oligomer) are two traditional structure representations of linear polymers. Except for simple polymers such as vinyl and acrylic polymers with merely carbon backbones, the existing methods can lead to a non-unique structure representation for polymers with heteroatomic backbones. To better elaborate on this issue, the products of polycondensation, PIs are used for exemplification.

As shown in Fig. 2a, a linear polymer with a RU composed of four colored fragments can be prepared via different reaction paths (e.g., RP1…) using a variety of monomers combinations (e.g., M1, M1 + M2…). Understandably, the use of “monomer” as structure representation can result in different property values (e.g.,\(P_{{\text{R}{\text{P}_1}}}^{{\text{c}\text{a}\text{l}}}\)…) because there are several possible monomer combinations. As a proof of concept, the polymer PB3APIP-alt-PMA (PI-I) has four distinguishable chemical structure fragments can be prepared via either RP4 using PMDA and MPD2-IP47 or RP2 using B3APTA and IPC48. It is clear that the structure of the polymer of interest cannot be uniquely described by the method of “monomer”. Alternatively, RU is another widely used structure representation of polymer, as shown in Fig. 2b. It can be found that four RUs can be theoretically assigned to the exemplified polymer, which thus lead to the inconsistency in the predicted property values (i.e.,\(P_{{\text{R}{\text{U}_1}}}^{{\text{c}\text{a}\text{l}}}\)…) by QSPR model. By using the same polymer (i.e., PI-I) as example, four distinct RUs are depicted and any of them can be used for feature engineering. Therefore, the structure representation by RU should be rationally ruled out.

To address the abovementioned issue, a so-called RRU method is proposed to deterministically represent polymer structure, which is shown in Fig. 2c. A unique ring fragment can be formed by connecting the head- and tail-groups of any RUs (i.e., RU1 ~ RU4). It should also be pointed out that another advantage over the method of “RU” is that atoms of a RRU are fully connected, which allows taking the influence of interaction between two adjacent RUs into account. The uniqueness of structure representation enabled by RRU is expected to give a sole predicted property value (i.e.,\(P_{{\text{R}\text{R}\text{U}}}^{{\text{c}\text{a}\text{l}}}\)) by QSPR model. This can be easily understood by the structure description of PI-I when compared to those by the methods of “monomer” and “RU”.

### Application Of Rru In Representing Polymers Containing (A)symmetric Monomers

The RRU method has been illustrated in previous section by using the linear polymer PI-I, which can be produced by polycondensation of two symmetric bifunctional monomers. However, the type of bifunctional monomers used in polycondensation for preparing linear polymers can be diverse. Considering the symmetry of monomer structure, the resulting polymers can be different in terms of monomer sequence along the chain. Therefore, three scenarios can be encountered in two-monomer polymerization systems, as shown in Fig. 3.

For a polycondensation of two symmetric bifunctional monomers, namely a most simple case, the resulting polymer is identified to be unique. It is clearly shown in Fig. 3a that polycondensation of 2,3-NDA and MPD-CF3 produces the polymer PFMPD-alt-NPA (PI-II)49 with a sole monomer sequence (i.e., Mo-Seq1). As a result, the structure of PI-II can be deterministically represented by a corresponding RRU. Figure 3b shows the case of a polycondensation involving an asymmetric monomer and a symmetric monomer, namely the most common case, in which a polymer having two different monomer sequences can be ideally formed (i.e., Mo-Seq1 and Mo-Seq2). As an example, polycondensation of asymmetric monomer AHPA and symmetric monomer 6FDA yields the polymer PAAP-alt-FPA (PI-III)50 with two distinct monomer sequences. It should however be noted that the structure representation of PI-III can be still unique by applying the method of RRU. The last scenario shown in Fig. 3c is the most complicated one, but it is a scarce one. In case reaction mechanism is not definite, two polycondensation polymers can be produced by using two asymmetric monomers and there thus are four different monomer sequences. For example, polycondensation of two asymmetric monomers (i.e., CDPNDCA and APNA) produces two polymers, PDPDCPD (PI-IV1) and its isomeride PI-IV251. Four distinguishable monomer sequences can be ideally identified, namely Mo-Seq1 and Mo-Seq2 for PI-IV1, Mo-Seq3 and Mo-Seq4 for PI-IV2. Accordingly, structure representations of the resulting polymers require two corresponding RRUs.

In light of the explanations set forth, it can be realized that one RRU is sufficient to represent one polycondensation polymer in different scenarios, which enables a deterministic calculation of descriptors in feature engineering. If there is a complex system with mechanistically indistinguishable polymeric products, more than one RRUs would be needed for representing the resultant polymers. Thus, the third case for example, an average descriptor is proposed for further model development.

### Qspr Model And Validation

On the basis of the proposed method of RRU, a MLR model using 29 norm descriptors was developed for quantifying the molecular structure-glass transition temperature relationship of PIs. Bidirectional stepwise regression was used for dimensional reduction. The as-developed *T*g-QSPR model for a broad data set of 1321 PIs is shown in Eq. (1). Definitions of norm descriptors (*I*) and their corresponding coefficients (*b*) are listed in Supplementary Information Table 1. The statistic indices *R*training2 and *R*testing2 are both 0.8793, which are greater than the minimum standard 0.6. In addition, the values of *R*testing2 are consistent with *R*training2, suggesting that the model has good stability and predictability.

$$\begin{gathered} {T_g}=\sum\limits_{{i=1}}^{9} {{b_i} \times {I_i}} +\frac{1}{{{n_A}}}\sum\limits_{{i=10}}^{{11}} {{b_i} \times {I_i}} +\frac{1}{{{n_{nH}}}}\sum\limits_{{i=11}}^{{19}} {{b_i} \times {I_i}} +\frac{1}{{\hbox{max} \left( {MS} \right)}}\sum\limits_{{i=20}}^{{26}} {{b_i} \times {I_i}} + \hfill \\ \mathop {}\nolimits^{{}} \mathop {}\nolimits^{{}} \frac{1}{{\sqrt {\sum\limits_{i} {\sum\limits_{j} {MS} } } }}\sum\limits_{{i=27}}^{{29}} {{b_i} \times {I_i}} {\text{-688}}{\text{.5554 }} \hfill \\ \end{gathered}$$

1

n = 1321; *R*2 = 0.8793; *Q*2 LOO=0.8718; AAE = 19.3836 ℃;

ntraining=1057; *R*training2=0.8793; AAEtraining=19.5722 ℃;

ntesting=264; *R*testing2=0.8793; AAEtesting=18.6284 ℃.

where nA is number of atoms, nnH is number of non-hydrogen atoms, and *MS* is step matrix defined by Eq. (2).

Figure 4a depicts the plot of calculated *T*g versus experimental *T*g, which shows that most of the data points are closely distributed along the diagonal. The experimental and calculated values of 1321 PIs are provided in Supplementary Data 1. The results of AAE (i.e., AAEtraining=19.57 ℃ and AAEtesting=18.63 ℃) are acceptable, within the range of the experimental measurement errors52. All results jointly confirm that the as-developed *T*g-QSPR model based on the RRU method has good goodness-of-fit. Notably, two PIs specifically marked in Fig. 4a are the polymers of interest that have two RRUs representing their mechanistically indistinguishable polymeric products. Their RRU-based structure representations are depicted in Fig. 4b. The corresponding predicted *T*g values are calculated through the average descriptors. It can be found that the absolute error (AE) between *T*g,exp and *T*g,cal is 5.53 ℃ and 9.04 ℃, respectively for these two PIs, which proves that average descriptor proposed for predicting polymer property resulting from two asymmetric monomers is reasonable.

To check the robustness and randomness of the as-developed model, LOO-CV and *Y*-random validation were performed. The results of LOO-CV are presented in Fig. 4c and Fig. 4d. *Q*2LOO is 0.8718, which is greater than the minimum standard 0.5. This result suggests that the *T*g-QSPR model is robust. AE distribution of LOO-CV is basically consistent with that of the model and AEs of *T*g for the majority of the PIs (i.e., 798 PIs) are within 20 ℃. What is more, plot of *R*Y2 versus *Q*Y2 for 10,000 *Y*-random validation is shown in Fig. 4e. The average values are 0.0219 and 0.0018, respectively for *R*Y2 and *Q*Y2, which are substantially smaller than the *R*training2 (0.8793) and *Q*2LOO (0.8718) of the model. The chance correlation can be therefore excluded during the modeling. In short, the results of external validation, internal validation, and Y-random validation prove that the as-developed *T**g*-QSPR model not only has high prediction accuracy, but also has good stability and reliability.

To further confirm the predictive performance of the as-developed model, a comparison with open reports covering method of structure representations, type of polymers, correlation coefficient was conducted, as summarized in Table 1. Three commonly used polymer structure representations are included. The use of “monomer” exhibits the worst predictive performance. Compared with other works, the data set used in this study has a broad range (> 1000 polymers) and the prediction accuracy is firmly good (*R*2 = 0.8793), which is comparable to the results predicted by using the RU-based descriptors. Although the use of oligomer gives the highest *R*2 of 0.9530, it can be attributed to the very limited data set of 88 polymers. It should be stressed here that problematic structure representation of polymers with heteroatomic backbones that causes the inconsistency in the predicted property values, which is usually hidden by a high correlation coefficient of prediction model. A unique structure representation enabled by RRU gives a sole predicted property value, which is crucial for *a priori* prediction of polymer properties and polymer design.

Table 1

Comparisons with other works for predicting *T**g*.

Representation | Ref. | Polymer | Method | Dataset (Training set/Testing set) | *R*2(*R*training2/ *R*testing2) |

Monomer | Khan et al.29 | Polyacrylates, polystyrenes, polyvinyls, *etc* | PLS | 206(-/-) | 0.7590(-/-) |

Monomer | Wen et al.30 | Polyimides | LASSO | 290(-/-) | 0.6889(-/-) |

Monomer | Karuth et al.34 | Polyesters, polyvinyles, polyethers, *etc* | GA-MLR | 100(80/20) | − (0.8400/0.5100) |

Repeating Unit | Ramprasad et al.23 | Polyethylene, polyesters, polyureas, *etc* | GPR | 451(360/91) | − (0.9200/0.9000) |

Repeating Unit | Wu et al.37 | Polyimides, polyesters, polyoxides, *etc* | BO & KN | 5917(-/-) | 0.8391(-/-) |

Oligomer | Díaz et al.39 | Polyethylenes, polyacrylates, polymethacrylates, *etc* | MLP-NN | 88(-/-) | 0.9530(-/-) |

Oligomer | Luo et al.40 | Polyesters, polyacrylates, polyoxides, *etc* | SVM | 1034(-/-) | 0.8650(-/-) |

Ring Repeating Unit | This work | Polyimides | MLR | 1321(1057/264) | 0.8793(0.8793/0.8793) |

Notes: PLS: partial least squares regression; LASSO: least absolute shrinkage and selection operator; GA-MLR: genetic algorithm-multiple linear regression; GPR: gaussian process regression; BO & KN: the back-off and the Kneser-Nay smoothing methods; MLP-NN: multi-layer perceptron neural network; SVM: support vector machine; MLR: multiple linear regression. |