Data Collection
In this study, we integrated data from three different datasets obtained from the Gene Expression Omnibus (GEO). These datasets included transcriptomic data from various breast cancer cell lines, such as MDA-MB-231, MCF7, HS578T, and BAS(a novel line isolated from a metaplastic breast cancer tumor), along with their drug-resistant derivatives. Collectively, these datasets contained a total of 28 samples, which provided valuable insights into the molecular characteristics and gene expression profiles associated with drug resistance in breast cancer cells. Each dataset utilized different microarray platforms, including GPL96, GPL16686 and GPL23159. By integrating and analyzing these datasets, we aimed to identify potential therapeutic targets and improve our understanding of drug resistance mechanisms in breast cancer[16].
The Fig. 1 visualizes the overlapping relationships between the gene sets of four breast cancer cell lines: BAS, HS578T, MCF7, and MDA-MB-231. Analyzing the counts and unique genes among these cell lines provides insights into their similarities and differences in gene expression profiles.
A total of 130,278 genes were identified across all four cell lines, with 27,207 genes found to be common (repeated) among them. The MCF7 cell line had the highest number of genes (53,617), while MDA-MB-231 had the lowest count (22,283). The BAS and HS578T cell lines shared an equal count of 27,189 genes.
Interestingly, when comparing the gene sets of BAS, HS578T, and MCF7 cell lines, 27,207 repeated genes were found, indicating a significant overlap between the three cell lines. Similarly, overlapping genes were observed in various combinations of cell line pairs, emphasizing their shared molecular characteristics.
However, unique gene sets were also identified for each cell line, indicating distinctions in their gene expression profiles and potentially contributing to their distinct biological behaviors and therapeutic responses. Overall, this Venn diagram analysis provides a valuable framework for understanding the genetic heterogeneity and similarities within these breast cancer cell lines, which could aid in developing targeted therapeutic strategies for different breast cancer subtypes.
The Table 1 summarizes the gene counts and their respective percentages for the four breast cancer cell lines: BAS, HS578T, MCF7, and MDA-MB-231. It provides an overview of the distribution of gene counts across these cell lines, highlighting their individual contributions to the total gene count and the uniqueness of their gene sets.The BAS cell line has a gene count of 27,189, which accounts for 20.87% of the total gene count. Notably, all of these genes are unique to this cell line, emphasizing its distinct molecular characteristics compared to the other cell lines.Similar to BAS, the HS578T cell line has a gene count of 27,189, contributing to 20.87% of the total gene count. Likewise, all genes are unique to HS578T, indicating that it has a unique gene expression profile compared to the other cell lines.With a gene count of 53,617, the MCF7 cell line represents 41.16% of the total gene count. Interestingly, all of these genes are unique to MCF7, highlighting the significant differences in gene expression between MCF7 and the other cell lines.The MDA-MB-231 cell line has a gene count of 22,283, constituting 17.10% of the total gene count. Again, all genes are unique to MDA-MB-231, underlining its unique molecular properties among the four cell lines.
Table 2
count and percent of cell lines genes, and number of unique genes in each cell line
|
BAS
|
HS578T
|
MCF7
|
MDA-MB-231
|
SUM
|
Count
|
27189
|
27189
|
53617
|
22283
|
130278
|
Percent
|
20.87%
|
20.87%
|
41.16%
|
17.10%
|
100%
|
Unique single
|
27189
|
27189
|
53617
|
22283
|
130278
|
Gene Expression Analysis:
In this section, we performed a comprehensive gene expression analysis using the integrated transcriptomic data obtained from various breast cancer cell lines (MDA-MB-231, MCF7, HS578T, and BAS) and their drug-resistant derivatives. The aim was to identify key genes and molecular pathways associated with drug resistance in breast cancer cells, potentially leading to the discovery of novel therapeutic targets.
Initially, we examined the gene counts and unique gene sets for each cell line. The MCF7 cell line had the highest number of genes (53,617), while MDA-MB-231 had the lowest count (22,283). Interestingly, all genes in each cell line were unique, highlighting the genetic heterogeneity among the four breast cancer cell lines and suggesting distinct molecular characteristics and biological behaviors.
To further investigate the similarities and differences in gene expression profiles, we analyzed the overlapping relationships between the gene sets of these cell lines using a Venn diagram. We found that 27,207 genes were common (repeated) among all four cell lines, indicating a shared molecular basis. However, the unique gene sets for each cell line suggested specific gene expression patterns and pathways contributing to their individual drug resistance mechanisms.
Our gene expression analysis provides valuable insights into the diverse molecular landscape of drug-resistant breast cancer cell lines. By identifying key genes and pathways associated with drug resistance, this study lays the foundation for the development of targeted therapeutic strategies and potential biomarkers for overcoming drug resistance in breast cancer treatment. Further functional studies on the identified genes and pathways will help improve our understanding of the underlying mechanisms and contribute to more effective treatment options for breast cancer patients.
Software and package information
R and packages: R: 4.0.3, affy: 1.68.0, Biobase: 2.50.0, frma:
1.42.0, hgu133plus2frmavecs: 1.5.0, ggbiplot: 0.55, genefilter:
1.72.1, ggplot: 3.3.4, preprocessCore: 1.52.0, sva: 3.38.0, impute:
1.64.0, WGCNA: 1.70–3, fastcluster: 1.2.3, dynamicTreeCut: 1.63–
1, limma: 3.44.3, biomart: 2.44.4, dplyr: 1.0.6, plotly: 4.9.4, tidyverse: 1.3.1, gridExtra: 2.3.
Python and modules: Python: 3.8.5, numpy: 1.19.2, pandas:
1.1.3, seaborn: 0.11.0, sklearn: 0.24.1, matplotlib: 3.3.2, conda
4.10.3.
The statistical programming language R was used for data processing and analysis. R was likely chosen due to its extensive capabilities in handling and analyzing large datasets, as well as its specialized packages for bioinformatics and genomics research
Data Output and Preparation for Machine Learning:
After analyzing gene expression data in R, the researchers generated a CSV file containing the following information for each gene:
ID Column
This likely refers to gene identifiers, such as gene symbols or accession numbers.
Adjusted P-value
The adjusted p-value corrects for multiple hypothesis testing to minimize false-positive results. This helps to identify statistically significant differences in gene expression between control and paclitaxel-resistant cells.
P-value
The p-value indicates the statistical significance of differences in gene expression between control and paclitaxel-resistant cells.
T
This might represent the t-statistic value from a t-test, which is used to compare the means of two groups (control and paclitaxel-resistant cells) to determine if there's a significant difference in gene expression.
B
The B-statistic value from a B-test, which is another statistical test for comparing gene expression data between two groups.
Log Fold Change (logFC)
The logarithm of the fold change in gene expression between control and paclitaxel-resistant cells. This measure helps to quantify the magnitude of change in gene expression.
Cell Line Label
This indicates the cell line (BAS, HS578T, MCF7, or MDA-MB-231) associated with each gene expression data point.
The resulting CSV file containing this information was used in subsequent machine learning stages, such as training and evaluating classifiers to predict paclitaxel resistance based on gene expression patterns.
Machine Learning Techniquess
This study employed various machine learning classifiers to classify paclitaxel-resistant cell lines based on gene expression analysis [5]. The selected classifiers were the Random Forest Classifier, Support Vector Machine (SVM), Gaussian Naive Bayes, K-Nearest Neighbors (KNN) Classifier, Decision Tree Classifier, and AdaBoost Classifier. Each classifier was chosen for specific reasons and offered unique advantages for this particular classification task.
The Random Forest Classifier was selected due to its ability to handle high-dimensional data and capture complex interactions between features [17]. It constructs multiple decision trees and combines their predictions to make accurate classifications. The ensemble nature of the Random Forest Classifier helps to reduce overfitting and enhance generalization performance.
The Support Vector Machine (SVM) was chosen for its effectiveness in dealing with both linearly separable and non-linearly separable data [18]. SVMs use hyperplanes to separate data points and create decision boundaries. They can handle high-dimensional data and are less prone to overfitting. SVMs have been successfully applied in various biological and medical classification tasks, making them a suitable choice for this study [19].
Gaussian Naive Bayes (GNB) was included as a probabilistic classifier. GNB assumes feature independence and uses Bayes’ theorem to compute the probability of a sample belonging to a particular class [20]. GNB is computationally efficient, especially for large datasets, and performs well in cases where feature independence holds to a reasonable degree. However, it may not capture complex interactions between features as effectively as other classifiers.
The K-Nearest Neighbors (KNN) Classifier was chosen for its simplicity and effectiveness in dealing with multi-class classification problems [21]. KNN assigns labels based on the labels of the nearest neighbors in the feature space. It is a non-parametric algorithm and does not make strong assumptions about the underlying data distribution. KNN is particularly useful when the decision boundaries are nonlinear and the number of classes is small.
The Decision Tree Classifier is a straightforward and interpretable classifier that creates a tree-like model based on feature values [22]. It is capable of handling both categorical and numerical data and provides insights into feature importance. Decision trees are useful for identifying relevant genes and understanding the decision-making process in the classification task [23].
Lastly, the AdaBoost Classifier was selected as an ensemble method that combines multiple weak classifiers to create a strong classifier [24]. It sequentially trains weak models on different subsets of the data, with more emphasis on misclassified samples in each iteration. AdaBoost is known for its ability to improve classification performance, especially when combined with simple base classifiers.
The selected classifiers for this work offer several benefits, such as efficiently processing high-dimensional gene expression data and identifying intricate gene interactions [25]. These classifiers excel in dealing with nonlinear decision boundaries and provide valuable information on feature importance, making them advantageous for gaining deeper insights into gene expression patterns [26, 27]. Additionally, their extensive use in bioinformatics and biomedical research highlights their appropriateness for this classification task [28].