Population selection
This study was a retrospective analysis focused on a total of 14043 pregnant women who received prenatal care in Shenzhen Longgang District Women and Children Healthcare Hospital from November 2017 to April 2019. Inclusion criteria consisted of: 1. pregnant women underwent NIPS between 12 weeks and 24 weeks of gestation 2. underwent first trimester serum screening or second trimester serum screening as well as at least one of the following clinical laboratory tests: complete blood count (CBC) test and thyroxin test. 3. singleton pregnancy. Each participant in our study was written informed and consented that laboratory and clinic information would be used in anonymous and non-profit scientific studies. The study was approved by the ethics committee of Shenzhen Longgang District Women and Children Healthcare Hospital before implementation.
NIPS And Fetal Fraction Calculation
For each participant, 5 ml maternal whole blood sample was collected using EDTA-K2 tubes (BD. UK). Cell free maternal plasma was separated and purified by centrifugation at 1600 g (10 min, 4 °C) for whole blood and at 13000 g (10 min, 4 °C) for plasma sequentially. DNA extraction and library construction were performed for each sample following the instruction of NIFTY chromosomal abnormality test kit (BGI, Wuhan, China)[10]. Library qualification was performed using Qubit 3.0 (Thermo, USA). 48 libraries were pooled into one mixed library which was single-end sequenced on BGI SEQ500 sequencing platform (BGI, Wuhan, China) with 35 read length and 9.7M average reads number of each sample.
After reads alignment to reference genome hg19 (bwa-0.7.11)[25], PCR duplication removing (SAMtools-1.9)[26], reads counting for 30 kb bins (bedtools-2.18.0)[27] and GC correction based on LOWESS regression, fetal fraction was calculated based on the method of chromosome Y(chrY) and SeqFF[28, 29]. Firstly, FF of all male fetal samples were obtained by chrY method. Then SeqFF model including an elastic regression and weighted rank selection criterion (WRSC) was fitted using male samples. Finally, FF of both male and female samples were calculated using the SeqFF model as the final outcome for following prediction. Standalone chrY fraction reference and SeqFF model was established in our lab to fit the BGI sequencing platform. In this study, the relative low FF status was defined as the FF was lower than 10th percentile of FF in all samples. While relative low FF status was a logical variable, the existence of the status was defined as 1 and non-existence as 0.
Association Between FF And Clinical Variables
To construct a machine learning model predicting the existence of low fetal fraction, we collected clinical and individual information which listed in Table.1 for each participant. NA value was accepted to enhance the robustness of the model. Maternal weight at first trimester or second trimester was included separately instead of BMI because height of stature was not accessible in this dataset.
For each variable, Pearson correlation coefficient with FF and AUC of ROC curve for predicting low FF status was calculated to explore the impact to FF. To investigate the risk factors of low FF status, adjusted OR and 95% confidence intervals (CIs) for each variable were calculated and adjusted by maternal age and gestation age. In addition, all specimens were divided into higher group and lower group of each variables, Mann-Whitney U was performed to assess the significance of FF in the two groups.
Random Forest Model And K-fold Cross Validation
Data pretreatment was performed before fitting prediction model. Firstly, clinical laboratory test results and information were normalized using z-score method after replacing outliers with NA values. Secondly, NA values were imputed using kNN method. Finally, the dataset was split randomly and equally into 10 folds. For each iteration, 7 folds were selected as training set which contains 70% specimens and the rest of 3 folds as test set which contains 30% specimens[30].
We established a supervised regression model using R package randomForest 4.6–14 which implements random forests algorithm introduced by Breiman et.al[31]. All clinical test results and information of training set were input as independent variables while relative low FF status as the response variable. Then, the fitted model was validated by test set to predict relative low FF status of which output ranged in [0,1]. Samples with the predicted value lower than 10th percentile were marked as predictive relative low FF status.
Overall accuracy, ROC curves and precision recall (PR) curves were selected as the performance evaluation metrics. To compare this comprehensive predictive method with single-variable based predictive method, we also defined another prediction method based on maternal weight in first and second trimester: samples with maternal weight higher than 90th percentile were marked as predictive relative low FF status. We compared the effectiveness of the predictive model in classifying low and normal FF status against maternal weight in first and second trimester.
Statistical analysis
Statistical analysis was performed using R-3.4.4 (https://www.r-project.org/). NA value imputation was performed using R package of DMwR (0.4.1) [32]. Kolmogorov-Smirnov test was applied for determining whether continuous variables could be fitted with a Gaussian distribution. Student's t test and Mann-Whitney U test were performed to compare differences of Gaussian and non-Gaussian distributed continuous variables, respectively. Chi square test and fisher exact test were used for categorical variables. Delong test was used for comparing the performance of two ROC curves. A probability value (p-value) of < 0.05 was considered statistically significant.