A comprehensive predictive method for low fetal fraction in noninvasive prenatal screening

The study was a retrospective cohort analysis based on the results of noninvasive prenatal screening (NIPS), complete blood count, thyroxin test and Down’s syndrome screening in first or second trimester from 14043 pregnant women. Random forests algorithm was applied to predict the low fetal fraction of cell free DNA (with FF lower than 10th percentile) through individual and laboratory information. Performance of the model was evaluated and compared to prediction using maternal weight.To investigate factors associated with lower FF in the NIPS and to develop a new predictive method for low FF before NIPS. Of 14043 cases, were while were Compared to using maternal as isolated the model has a higher area under curve(AUC) of Receiver Operating Characteristic (ROC) and overall accuracy.

The study was a retrospective cohort analysis based on the results of noninvasive prenatal screening (NIPS), complete blood count, thyroxin test and Down's syndrome screening in first or second trimester from 14043 pregnant women. Random forests algorithm was applied to predict the low fetal fraction of cell free DNA (with FF lower than 10th percentile) through individual and laboratory information. Performance of the model was evaluated and compared to prediction using maternal weight.To investigate factors associated with lower FF in the NIPS and to develop a new predictive method for low FF before NIPS.

Results
Of 14043 cases, maternal weight, RBC, HGB and free T3 were significantly negative correlated with FF while gestation age, free T4, PAPP-A, AFP, uE3 and β-hCG were significantly positive correlated with FF. Compared to prediction using maternal weight as isolated parameter, the model has a higher area under curve(AUC) of Receiver Operating Characteristic (ROC) and overall accuracy.

Conclusions
The comprehensive predictive method based on combined multiple factors was more effective than single-factor model in low FF status prediction. This method can provide more information for clinical choice and pre-test quality control of NIPS.

Background
Noninvasive prenatal screening (NIPS) is a method of fetal chromosomal aneuploidy screening mainly based on next generation sequencing (NGS). It has been validated in multiple clinical cohorts that NIPS is highly sensitive and specific for patients at increased risk of T13, T18 and T21 aneuploidies [1][2][3]. It also has potential application value in the prenatal screening of copy number variables (CNV) and single nucleotide variables (SNV) by target sequencing and extending the sequencing depth [4,5]. Recently, NIPS is a widespread first-tier prenatal screening method used for high risk pregnancies of aneuploidy [6].
Cell free DNA in maternal plasma is the sequencing target of NIPS and derived from both mother and fetus. While z scores are used for evacuating risk of aneuploidy of chromosome 13, 18 and 21, fetal fraction (FF) of cell free DNA is also considered critical to the results [7,8]. FF is based on the sequencing results of NIPS and is calculated using chrY concentration or algorithms of generalized linear regression such as seqFF [9]. It is conformed that FF lower than 3.5%-4% is a key factor may cause no-call reports or false positive and negative results [10,11]. The American College of Medical Genetics and Genomics (ACMG) supported that a clearly visible fetal fraction should be included on NIPS reports [11]. It is significant to determine whether FF is too low to ensure the quality of NIPS results.
However, it is a challenge to predict the FF before NIPS and provide the predictive FF as a reference information for avoiding no-call or unreliable reports. Several studies suggested that individual profiles such as maternal weight and gestation age were factors that influence FF [12,13]. Significant obesity and gestation age below 9-12 weeks are linked to low FF and high probability of unreliable results, consequently not recommended for NIPS. FF is also speculative to be associated with placental development and volume which partially relates to chromosomal aneuploidy and Down's serum screening in first and second trimester [14]. Nevertheless, because of the insufficient knowledge about mechanism of generation and degradation of cell free DNA, the relationships between FF and these variables are not significant enough to predict whether FF is too low accurately by an isolated factor [15].
The aim of this study was to investigate the factors associated with lower fetal fraction and to better predict the low FF status. This study was based on the data collection which contains NIPS information and the results of complete blood count, thyroxine test and Down's serum screening in first and second trimester in our center. Random forest algorithm, an ensemble machine learning algorithm commonly used in various prediction scenarios and appropriate for data with high dimension and collinearity, was applied for comprehensive analysis of clinical and individual profiles [16]. The performance of prediction was evaluated and compared with single-factor predictive models using maternal weight to provide a more effective reference for the choice of NIPS clinical application.

Profiles of study population
In this dataset, 10th percentile of FF was 5.1%. The characteristics and profiles of 14043 NIPS participants with normal (FF > 5.1%) or low (FF < = 5.1%) fetal fraction status were described in

Associations Between Laboratory Measurements And Fetal Fraction
To reveal the relationships between laboratory measurements and fetal fraction, Pearson correlation coefficient were calculated for fetal fraction values. In addition, AUC of ROC and adjusted odds ratios (OR) were obtained for low or normal fetal fraction status (Table.2). The difference of each measurement in low and normal fetal fraction status group was showed in Fig. 1A. There was no significant linear association between all the serum markers and fetal fraction (r < 0.2). however, some variables were predictive and suggestive for fetal fraction status. In low fetal fraction status group, RBC, HGB and free T3 were significantly higher while free T4, PAPP-A, AFP, uE3 and β-hCG in first and second trimester were significantly lower. AUCs and adjusted ORs showed that free T3, RBC and HGB were relative risk factors of low FF status and PAPP-A, AFP, uE3 and β-hCG in first and second trimester were protective factors ( Fig. 2A). Remarkably, higher Maternal weight in first and second trimester was significantly associated with low FF status. In contrast, these significant associations between TSH and maternal age with FF status was not observed. Gestation age is positive related to FF weakly (r = 0.2184), but not obviously predictive for FF status (AUC = 0.5204) in 6 our dataset.

NA Value Imputation
The k-nearest neighbors (kNN) algorithm was applied in NA value imputation to make the dataset complete. Median of each variant before and after Not Available (NA) value imputation were shown in Table 2. For each variable, Mann-Whitney U test showed no significant difference was observed before and after imputation (p > 0.05). Density plots showed that except RBC and HGB, the distribution density curve after imputation was nearly close to the distribution density curve before imputation, indicating there was no significant impact on data distribution characteristics induced by NA value imputation (Fig. 1B). The distribution transformation in RBC and HGB might be caused by too high proportion of NA values.

Performance Evaluation Of Predictive Model
According to the correlation analysis hereinbefore, TSH and Maternal age were excluded for model training. Gestation age, maternal weight in first and second trimester, RBC, HGB, free T3, free T4, PAPP-A, AFP, uE3 and β-hCG in first and second trimester were input into predictive model as dependent variables. After 10 iterations of model training and validation, an average of ROC and PR curve for predicted FF status was showed in Fig. 2B. Compared to maternal weight, the AUC of random forest model (0.7022) was significantly higher(p < 0.001), which indicated that the model was more effective than maternal weight in first trimester (0.6595) and second trimester (0.6508) in prediction of low FF status. PR curve also showed that random forest model had better performance than maternal weight. In the predicted low FF status group, FF was significantly lower than in predicted normal FF status group (Fig. 2C). The overall accuracy of the predictive model is 0.8532, higher than the overall accuracy of the prediction through maternal weight in first (0.8083) and second trimester (0.8076).

Discussion
The amount of fetal DNA in maternal plasma sample is affected by multiple factors which make challenges in prediction of low fetal DNA fraction before NIPS. In this study, to predict the existence of low FF status, we collected 14 laboratory test results and individual information, and fitted a regression model based on random forests algorithm. In our cohort, we confirmed that the model was more effective in predicting low FF status than using isolated maternal weight and was robust against incomplete observations.
Fetal fraction is a varied and complex biological indicator which influenced by individual difference and laboratory factors. It has been reported that overweight pregnant women are at higher risk of test failure due to low fetal fraction. This might be due to a dilutional effect and higher release of maternal cfDNA from adipose cells into the systemic circulation [17]. Gestation age is another relevant factor of fetal fraction. Wang E. et.al suggested that cfDNA rises by almost 1% per week after 20 weeks of gestation, but only 0.1% per week between 10 and 21 week of gestation [13]. It has been suggested that low placental volume could also lead to a low fetal fraction. G. Ashoor.et.al reported that fetal fraction increases with maternal serum level of free β-hCG and PAPP-A which reflected placental volume and development [18,19]. Besides, fetal numbers, fetal aneuploidies and maternal smoking are reported to be associated with lower fetal fraction [20,21]. It has also been found that the maternal exercise leads to a decrease in the fetal fraction, as the level of the maternal cfDNA increases directly after physical activity [22]. It should be noted that fetal fraction could also to be affected by sample transport or laboratory work flow [23]. In our cohort, gestation age, maternal weight in first and second trimester, serum markers of Down's syndrome screen such as PAPP-A, AFP, uE3 and β-hCG were all weakly correlated with FF, which was confirmed by previous studies. In addition, this was a novel report of the negative correlation between RBC, HGB, free T3 and free T4 with fetal fraction.
Although it is difficult to reliably predict the value of fetal fraction without quantitative experiment.
Instead of quantitatively predicting fetal fraction, we suggested that qualitative prediction of low FF status binarized from fetal fraction is practicable. Recently, isolated information with a constant cutoff value such as maternal weight and gestation age is used for evaluation of low FF status and protection from unreliable reports. In this study, we confirmed that prediction based on combined clinical and laboratory indicators using machine learning algorithm is more effective. For the positive prediction, sequencing shorter cfDNA fragments or cfDNA enrichment can be used to improve the FF of NIPS [24]. Furthermore, there was a large number of missing values in the raw clinical data set in which directly filtering of missing values will result in insufficient specimens for subsequent analysis.
Incomplete data is common in clinic and cannot be avoided. Therefore, we adopted missing value imputation to reduce the impact of missing values and increase the robustness of the model.

Conclusions
In conclusion, in this study we reported a new method based on comprehensive clinical and laboratory information for predicting the relatively low fetal fraction status in NIPS. It was conformed that the prediction model was more effective than prediction using maternal weight as independent variable. This study was an application of machine learning in prenatal screen and would provide more reference information for clinical choices of NIPS.

NIPS And Fetal Fraction Calculation
For each participant, 5 ml maternal whole blood sample was collected using EDTA-K2 tubes (BD. UK).
Cell free maternal plasma was separated and purified by centrifugation at 1600 g (10 min, 4 °C) for whole blood and at 13000 g (10 min, 4 °C) for plasma sequentially. DNA extraction and library construction were performed for each sample following the instruction of NIFTY chromosomal abnormality test kit (BGI, Wuhan, China) [10]. Library qualification was performed using Qubit 3.0 (Thermo, USA). 48 libraries were pooled into one mixed library which was single-end sequenced on BGI SEQ500 sequencing platform (BGI, Wuhan, China) with 35 read length and 9.7M average reads number of each sample.
After reads alignment to reference genome hg19 (bwa-0.7.11) [25], PCR duplication removing (SAMtools-1.9) [26], reads counting for 30 kb bins (bedtools-2.18.0) [27] and GC correction based on LOWESS regression, fetal fraction was calculated based on the method of chromosome Y(chrY) and SeqFF [28,29]. Firstly, FF of all male fetal samples were obtained by chrY method. Then SeqFF model including an elastic regression and weighted rank selection criterion (WRSC) was fitted using male samples. Finally, FF of both male and female samples were calculated using the SeqFF model as the final outcome for following prediction. Standalone chrY fraction reference and SeqFF model was established in our lab to fit the BGI sequencing platform. In this study, the relative low FF status was defined as the FF was lower than 10th percentile of FF in all samples. While relative low FF status was a logical variable, the existence of the status was defined as 1 and non-existence as 0.

Association Between FF And Clinical Variables
To construct a machine learning model predicting the existence of low fetal fraction, we collected clinical and individual information which listed in Table.1 for each participant. NA value was accepted to enhance the robustness of the model. Maternal weight at first trimester or second trimester was included separately instead of BMI because height of stature was not accessible in this dataset.
For each variable, Pearson correlation coefficient with FF and AUC of ROC curve for predicting low FF status was calculated to explore the impact to FF. To investigate the risk factors of low FF status, adjusted OR and 95% confidence intervals (CIs) for each variable were calculated and adjusted by maternal age and gestation age. In addition, all specimens were divided into higher group and lower group of each variables, Mann-Whitney U was performed to assess the significance of FF in the two groups.

Random Forest Model And K-fold Cross Validation
Data pretreatment was performed before fitting prediction model. Firstly, clinical laboratory test results and information were normalized using z-score method after replacing outliers with NA values.
Secondly, NA values were imputed using kNN method. Finally, the dataset was split randomly and equally into 10 folds. For each iteration, 7 folds were selected as training set which contains 70% specimens and the rest of 3 folds as test set which contains 30% specimens [30].
We established a supervised regression model using R package randomForest 4.6-14 which implements random forests algorithm introduced by Breiman et.al [31]. All clinical test results and information of training set were input as independent variables while relative low FF status as the response variable. Then, the fitted model was validated by test set to predict relative low FF status of which output ranged in [0,1]. Samples with the predicted value lower than 10th percentile were marked as predictive relative low FF status.
Overall accuracy, ROC curves and precision recall (PR) curves were selected as the performance evaluation metrics. To compare this comprehensive predictive method with single-variable based predictive method, we also defined another prediction method based on maternal weight in first and second trimester: samples with maternal weight higher than 90th percentile were marked as predictive relative low FF status. We compared the effectiveness of the predictive model in classifying low and normal FF status against maternal weight in first and second trimester.

Ethics approval and consent to participate
The study was approved by the ethics committee of Shenzhen Longgang District Women and Children Healthcare Hospital before implementation.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests.

Funding
The presented work was supported by grants of Shenzhen Science and Technology Innovation Commission (Grants numbers: JCYJ20160427114320284 and JCYJ20180305125647151) and grants of Shenzhen Longgang Science and Technology Innovation Commission (Grants numbers: LGKCZSYS2018000010