Cybersecurity enhancement to detect credit card frauds in health care using new machine learning strategies

As the usage of credit cards has become more common in health care applications of everyday life, banks have found it very difficult to detect credit card fraud (CCF) systematically. The fraudulent activities should be identified and detected using new techniques. As a result, machine learning (ML) can help detect CCF in transactions while reducing the strain on financial institutions. This research aims to improve cybersecurity measures by detecting fraudulent transactions in datasets. The new classifier strategies cluster and classifier-based decision tree (CCDT), cluster and classifier-based logistic regression (CCLR), and cluster and classifier-based random forest (CCRF) are modeled in this research. The proposed strategies are applied to detect fraudulent health care activities. This research performed the preprocessing through the feature extraction, sampling, and transformation stages, and the proposed classifiers are simulated, and the results are analyzed. The significant results expected range of the proposed classifiers over the other methods are accuracy—(99.95%, 99.97%), precision—(99.96%, 99.98%), sensitivity—(99.9%, 100%), specificity—(99.8%, 100%). The parameters μ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu $$\end{document}, location, the binary variable, cluster size, and decision tree sampling observations affect the classifiers' performance. CCRF and CCLR obtain the expected significant results than other existing methods.


Introduction
The use of credit cards has increased as the world moves toward digitization and money transactions become paperless. Credit card utilization has started to increase among all customers due to the necessary and urgent requirements. When making an online purchase, many consumers prefer to use credit cards (Tran and Dang 2021). Credit cards assist us in making purchases even if we do not have the necessary cash. Unfortunately, it appears that fraudsters are keeping track of these aspects and are even succeeding in exploiting them in this evolving environment. Today, fraudsters can be creative, intelligent, and fast, so fraud activities involving credit cards have also been on the rise, resulting in significant losses for individuals and financial institutions (Li et al. 2021). Credit card fraud (CCF) occurs when someone uses another person's credit card or account details to make illegal purchases or use the fund. Most online fraud transactions were made remotely only using credit card data. In most cases, the credit cardholder is unaware that their card information has been stolen and used by someone else.
Since online transactions increase every month, there is a significant increase in fraudulent operations. CCF is one of the most problematic, so we must design new strategies to detect it. Many fraud detection methods and soft computing strategies are analyzed to minimize the effects of CCF. These methods and strategies are trained on the earlier transactions to predict the newer ones. The ML strategies work well when the distribution of dataset classes is balanced. Several methods like ensemble, data, and algorithmic level strategies are developed to solve when the datasets are not balanced. The reinforcement learning strategy classifies the imbalance distribution, the problem is formulated using linear decision-making, and Q-learning is applied.
Extended author information available on the last page of the article The contribution of this research is to improve cybersecurity by detecting fraudulent transactions in large-scale datasets using the new classifier strategies such as cluster and classifier-based decision tree (CCDT), cluster and classifier-based logistic regression (CCLR), and cluster and classifier-based random forest (CCRF). The proposed classifiers are applied in the detection of fraudulent health care activities. This research performed the preprocessing through the feature extraction, sampling, and transformation stages, and the proposed classifiers are simulated, and the results are analyzed.
This research article is structured as follows. The critiques of the literature survey are discussed in Sect. 2. The proposed classifiers are explained in Sect. 3. The simulation outcomes are analyzed and compared with the other methods in Sect. 4. The conclusions with the future scope are drawn in Sect. 5.
2 Literature survey and critiques CCF significantly affects the financial industry and daily life. Fraud can weaken the public's trust in the institution (Fatima et al. 2021). As a result, we must analyze and distinguish between fraudulent and non-fraudulent transactions. Different strategies are developed in the literature that follows the pattern of all transactions and identifies the fraudulent ones to solve this problem. Techniques such as normalization-based clustering are developed to minimize the clustering attributes. The unsupervised methods are designed to detect fraud. The Bayesian-based sensitive method is developed with cost optimization measures. The computing methods such as artificial intelligence (AI), genetic algorithms (GA), data mining, sequence alignment, and genetic programming are also developed to minimize the risks (Hoang et al. 2020;Sethumadhavan 2018, 2020;Belmonte et al. 2020).
The datasets are balanced using synthetic and sampling methods, and ML, RF, KNN, and DT LR are applied to training. Some additional classifiers are introduced using boosting and neural networks (NN). The most critical issues are only when the data are not balanced. CCF results in unexpected loss for companies and customers; hence, optimal methods are expected to prevent and detect CCFs. The reliable expectations are obtained using kRNNs and Naive Bayes (NB) methods. The regression is applied with ensemble classifiers, nearest neighbors, and sampling methods. The transactions of CCF databases are identified using neuroadaptive, Markov, and stochastic methods. Anomaly detection is also applied for detecting CCFs. The divide and conquer strategy is applied with the entropy measure and hyperparameters to convert the problem into a balanced one. The classifiers' performance is improved using overlapping, and R-value feature selection approaches (Bhaskaran and Marappan 2021;Dang et al. 2021).
The probabilistic RF with autoencoder method utilized the low-dimensional features extraction and applied it for imbalanced datasets (Lin and Jiang 2021). Some categorical attributes with multiple domains as high-cardinality attributes are there in credit card transactions. The domain reduction method is proposed using FFNN to reduce the size of attributes (Carneiro et al. 2022). Sequential fraud detection is achieved using SVM and isolation forest methods (Sharma et al. 2021). The neural network (NN) is applied with the hybrid resampling technique to detect the public datasets' fraud (Esenogho et al. 2022). The hybrid method is developed to identify fraud using XGBoost (Dalal et al. 2022). This method applied different classifiers using ML, but the resources are not centralized, and the constraints are unique. The NN model is developed using LSTM with a linear data model and attention strategies (Benchaji et al. 2021). The evolutionary optimization with support vector data description is developed in the parameters optimization to obtain good accuracy (Mniai and Jebari 2022). The method does not consider the selection features and the integration framework. The fraud losses and FNRs are reduced using the DL algorithms (Alarfaj et al. 2022). ML-and AI-based heuristics and local search strategies are applied to detect CCFs (Jain et al. 2022;Trivedi et al. 2020). These models' accuracy, recall, and precision measures are further improved using new strategies.
The following are the major drawbacks of the existing models: • More significant differences between the negative and positive classes count. • The evolution of fraud characteristics through data shift. • The oversight of linear resources in between the adjacent transactions.
Thus, the design strategies are required to fulfill the following criteria: • CCF activities identification and risk reduction in financial sectors. • Improve the performance of unbalanced classifiers.
• Extraction of the credit card's low-and high-dimensional features to produce a better probabilistic classification. •

Novelty of the proposed model
The proposed model is developed using the following novelty and main contributions: • Hybrid classifier and clustering strategy: This strategy applies the classification for the classifier induction using stochastic centroid clusters to preprocess the data further to obtain better measures. • Hybrid method in CCDT, CCRF and CCLR: The probabilistic hybrid distribution-based hierarchical and density-based clusters are applied in the model for better validation with measures. • Classifier-based sampling strategy: This strategy is applied to classify non-fraud and fraud labels. Probabilistic sampling is applied with descriptive and element classifiers to improve the measures of classification outcomes. • Preprocessing strategy: Applying the initial process using feature extraction, multivariate probabilistic sampling, and probabilistic transformation to all the classifiers.

Architectural components and classifiers
The architecture of the proposed model involves some components-preprocessing through feature extraction and stochastic transformations, splitting and training the datasets in the ratio of 75% training and 25% testing, probabilistic clustered classification, and measures evaluation. The components are required to develop a new model to identify the CCFs using new strategies that play a role in fraud detection since they are frequently used to extract hidden information from the large-scale dataset. The architecture includes examining and preprocessing data sets and applying ML to analyze credit card spending patterns and identify fraudulent transactions. The proposed classifiers target improving cybersecurity by detecting fraudulent transactions in the dataset using new classifier strategies such as CCDT, CCRF, and CCLR. The preprocessing is performed through feature extraction, multivariate probabilistic sampling, and probabilistic transformation stages. The preprocessing operations are applied in all three classifiers at the initial stage. The overall flow diagram of the proposed model is sketched in Fig. 1. The architectural components of the proposed model involve the following: • Preprocessing the datasets.
• Selection of the model. The preprocessing of the dataset involves the following operations: • Import the dataset.
• Search and remove the null values.
• Apply the feature extraction, multivariate sampling, and probabilistic transformation stages. • Update the dataset.
The classifiers are created using the following operations: • Extracting the test set from the historical data.
• Train the test dataset.  The CCLR algorithm for binary classification is defined in Algorithm 1. This algorithm operates on the preprocessed dataset using the supervised strategy. This algorithm returns the probability of a binary variable. The standard logistic curve is shown in Fig. 2, and the LR is sketched in Fig. 3. The algorithm determines the expected clusters and predictors. The probability of a response, p x ð Þ is calculated for all clusters and predictors.
Algorithm 1: CCLR 1: Apply the preprocessing operation through feature extraction, multivariate sampling, and probabilistic transformation stages.
2: Define the number of clusters and predictors.
3: Determine the probability of a response for a given variable using 5: Determine p x ð Þ for all clusters and predictors.
The CCRF algorithm for classification is defined in Algorithm 2. The DT-based RF is an ensemble-based method that includes many DTs, as sketched in Fig. 4. Several outcomes are obtained for every DT in the forest. This algorithm constructs several trees, and the equivalent classes are built as a DT using the posterior probability; p a=x ð Þ. All outcomes are merged at the end to obtain stable and accurate predictions.
Algorithm 2: CCRF 1: Apply the preprocessing operation through feature extraction, multivariate sampling, and probabilistic transformation stages.
2: Define the number of clusters and predictors.
3: Randomly extract the samples from the training subsets. The CCDT algorithm for problem classification is defined in Algorithm 3. The structure of the DT elements is depicted in Fig. 5. The CCDT is constructed using the predictors and clusters. The decision and association rules are applied to optimize the constructed DT. Finally, the classification and knowledge inference rules are optimized.

Simulation and analysis
This section focuses on the datasets, results and analysis, and comparison with other methods of the proposed classifiers.

Datasets
This project applied the dataset of CCF detection from Kaggle.com, which contains two-day credit card transaction details of people from Europe. The dataset contains 31 attributes, including amount, class, and time.
The features of this dataset are as follows: (labels, class-0 & 1), (columns, 31), (missing values, none), (rows, 284,807), (features, 30), (type, object). The significant attributes of the datasets are based on the principal components, numeric variables, amount, time, and class. Due to the card payment and industry data security standards, the original data of credit card users must be masked before being published due to confidentiality. The proposed model is implemented using Python. The simulation parameters are defined in Table 2.

Results and analysis
The proposed model is simulated on the benchmark dataset, and the target attribute is analyzed and sketched in Fig. 6. This diagram consists of the number of genuine and fraudulent transactions in the dataset plotted using the class attribute. From the plot, we can understand that the dataset's fraudulent transaction is much fewer than genuine ones. The performance metrics are evaluated using the measures-true negative (TN), true positive (TP), false negative (FN), and false positive (FP). The proposed strategies are evaluated using the following metrics. The proposed model is simulated, and the experimental results are analyzed for the metrics defined from (4) to (7). The parameters l, location, binary variable, cluster size, and sampling observations of the decision tree affect the performance of the classifiers. The typical expected range for l (0.1, 0.5), x (0.5, 0.8), cluster size (5, 15), and sampling observations (100, 500). The most used measures to

CCDT, CCRF, CCLR matrix analysis
The histogram of the fraud class for the imbalanced dataset is shown in Fig. 6. This diagram depicts the classes on the X-axis and the frequency on the Y-axis. The CCDT confusing matrix is sketched in Fig. 7. The CCRF confusion matrix is sketched in Fig. 8. The outcome of CCLR is sketched in Fig. 9.

Comparison of results
The accuracy, precision, sensitivity, and specificity comparison of the proposed model with other methods are shown in Figs ). The parameters l, location, the binary variable, cluster size, and decision tree sampling observations affect the classifiers' performance. CCRF and CCLR obtain the expected significant results than other existing methods. The typical expected range for the parameters in obtaining the better measures are l (0.1, 0.5), x (0.5, 0.8), cluster size (5, 15), and sampling observations of probabilistic classification (100, 500). When the cluster size exceeds 15, and for large samples, it is necessary to modify the parameters l and x. In the future, recent soft computing strategies will be applied to enhance the performance and to apply the methods on different large-scale datasets to get a more accurate prediction model to overcome CCF detection (Marappan and Sethumadhavan 2021;Alfaiz and Fati 2022;Malik et al. 2022).
Funding The authors have not disclosed any funding.
Data availability Not applicable and not available.

Declarations
Conflict of interest The authors have no conflict of interest in publishing the paper.
Ethical approval This article does not contain any studies with human participants or animals performed by any authors.