Malicious Website Identification Using Design Attribute Learning

doi:10.21203/rs.3.rs-2157103/v1

Download PDF

Research Article

Malicious Website Identification Using Design Attribute Learning

https://doi.org/10.21203/rs.3.rs-2157103/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 24 Mar, 2023

Read the published version in International Journal of Information Security →

You are reading this latest preprint version

Malicious websites pose a challenging cybersecurity threat. Traditional tools for detecting malicious websites rely heavily on industry-specific domain knowledge, are maintained by large-scale research operations, and result in a never-ending attacker-defender dynamic. Malicious websites need to balance two opposing requirements to successfully function: escaping malware detection tools while attracting visitors. This fundamental conflict can be leveraged to create a robust and sustainable detection approach based on the extraction, analysis and learning of design attributes for malicious website identification.

In this paper, we propose a next-generation algorithm for extended design attribute learning that learns and analyzes web page structures, contents, appearances and reputations to detect malicious websites. A large-scale experiment that was conducted on more than 35,000 websites suggests that the proposed algorithm effectively detects more than 83% of all malicious websites while maintaining a low false-positive rate of 2%. In addition, the proposed method can incorporate user feedback and flag new suspicious websites and thus can be effective against zero-day attacks.

Malicious websites

Website design attributes

Machine learning

Cybersecurity

Human-computer interaction

Malicious websites form a major cyberattack vector [1]. Detecting malicious websites is a challenging task, as malicious websites come in different formats and are often bundled with useful content, such as software, that is downloaded by naive users.

Traditional detection techniques rely on domain expertise and leverage advanced industry knowledge to detect indications of malicious activity. This approach results in a constant need for the research and development of detection capabilities, such as signatures and tailor-made features, to detect malicious activity. Among these techniques, we can find Document Object Model (DOM) analysis methods [2–4], JavaScript scanning techniques [5, 6], analyses of the software properties linked to websites [7–10], URL analysis approaches [11–14], user navigation path analysis methods [15, 16], Text Mining [17–19] and process mining strategies [20].

One problematic effect related to the domain expertise approach is the unending arms race that it creates. The tailor-made features designed as part of the detection method will eventually be bypassed by the attacker, resulting in a need to create new tailor-made features. In addition, this approach is mainly relevant for detecting known threats and attack vectors; as a result, it seems to be far less effective in detecting emerging threats and zero-day attacks.

Another problematic effect of the traditional detection techniques is the symmetry they create — the attacker can essentially have access to the same data that the defender uses to train their detection model, reverse engineer it and bypass the model.

As previously suggested, malicious websites need to balance two opposing requirements to successfully function: escaping detection tools while attracting visitors [21, 22]. To attract visitors, a website needs to signal its claimed functionality to potential users by leveraging appearance, content and experience [23]. This fundamental conflict can be exploited to create a robust and sustainable classification approach. As suggested in a previous work by Cohen et al. [24] websites can be accurately classified and categorized by their design attributes.

In this paper, we propose a framework for detecting malicious websites by extensively learning their design attributes. The suggested approach was tested on a large-scale imbalanced dataset that included a total of 35,707 website records, 697 of which are malicious. This dataset was assembled to accurately represent the commercial, real-life scenario of malicious website identification. Much attention was given to properly representing the malicious website ratio out of the entire population [25] and to ensuring that the malicious websites were generated from the same initial list and ranking system as the legitimate website population [26].

Validating the suggested approach on a real-life large-scale dataset poses some key challenges. For instance, the noise and variance are much greater than those of a carefully selected dataset. In addition, an extremely imbalanced dataset requires proper measures when analyzing the data and training the relevant models.

The suggested approach can effectively detect more than 83% of malicious websites while maintaining a low false-positive rate (FPR) of 2%. In addition, it was proven effective in detecting malicious and suspicious websites that allegedly slipped under the radar in previous studies. The suggested framework also offers explainability and can leverage the cybersecurity practitioner’s experience and feedback to perform better and respond to emerging threats.

Another part of our contribution lies in assembling and sharing this unique and high-quality dataset, consisting of multiple design attribute features and third-party enrichment, with the research community.

Malicious website detection techniques can traditionally be divided into 2 main approaches: dynamic analysis and static analysis [27].

Dynamic analysis is usually performed by analyzing a website’s execution dynamics [6, 28–30]. The basis of this approach involves the idea of looking for a signature of malicious activity such as the creation of an unusual process, repeated redirection, etc. Dynamic analysis techniques have inherent risks and are difficult to implement and generalize. These techniques are often implemented in controlled or isolated environments [5] using virtual machines [31] or honey client systems [32]. However, this type of approach provides deeper visibility into website behavior, as features extracted using dynamic analysis can accurately capture processes and contents that are available only after the website is fully loaded.

Static analysis focuses on the content and information that are available without executing the website’s actual source code [11, 33]. The extracted features can typically include lexical features from the URL string [12, 13, 34, 35], HTML and JavaScript content, information about the host and the domain, and traffic and usage intelligence provided by third parties. Static analysis techniques that apply machine learning have been extensively investigated and have achieved good results.

Static content analysis has been found to be highly effective in detecting phishing websites. Under the assumption that a phishing website aims to lure the end user to enter their credentials and sensitive information, a limited set of domain expert features can be extracted from URL strings and HTML elements to accurately detect phishing websites. HTML elements such as < ifrmae > or < input > tags accompanied by indicative words such as “password” and “credit card” were previously suggested to be highly effective in detecting phishing websites [36, 37]. Additional expert-based features, such as the number of anchors and links, were also investigated [38, 39], and when combined with previous work, the authors achieved a true-positive rate (TPR) of 95% for the specific scenario of phishing website detection. In an attempt to create a more robust detection technique, Altay suggested a keyword density-based approach for detecting malicious websites [40].

This technique was tested with a support vector machine (SVM) model that was trained on a large-scale dataset, and it achieved a high accuracy of 96.7% and a TPR of 94.2%. Similar approaches were suggested for successfully detecting click hijacking attacks on web pages [41, 42]. These approaches are well suited for the detection of websites involved in phishing attacks, which limits their ability to detect other types of malicious websites.

In an attempt to generalize static web page content analysis to detect different types of malicious websites, Amrutkar proposed the kAYO approach, which combines static analyses of mobile web pages based on URL, JS, HTML and mobile-specific contents [43]. A logistic regression model was trained on a large-scale imbalanced dataset and achieved a high TPR of 89%. However, the overall accuracy was only 90%, and the FPR was 8%. McGahagan performed a comprehensive evaluation of web page content for detecting malicious websites via 8 different supervised machine learning models and reported an accuracy of 89% with an FPR that could reach 10% [44]. This work emphasized both the potential and the challenge of using static analysis for detection of nonphishing malicious websites and raised concerns regarding the ability to implement this approach in a real-life commercial scenario due to the high induced FPR.

To better understand the implications for a real-life commercial scenario, one can examine a mid-market enterprise in the United States (US). The average US enterprise employs between 1,000 and 2,000 people. In 2021, the average US internet user was accessing more than 100 unique web pages every day. As a result, we can estimate that altogether, the employees of one mid-market enterprise are accessing at least 100,000 web pages every day. Under the careful assumption that at least 10,000 of these websites are unique, an FPR of 8% means that a system will provide alerts regarding more than 800 web pages on a daily basis.

Liu suggested that both the lower accuracy and high FPR achieved by static analysis techniques in recent years are results of the spam techniques used by malicious websites [45]. These techniques cause meaningful content to be invisible to static analysis tools. As a result, a convolutional neural network (CNN) model that analyzes captured website images was suggested. This model was trained on a balanced dataset containing 6K screenshots and reported a TPR of 93.6% and an FPR of 5.3%.

Cohen et al. [24] proposed a website assessment scheme based on the website design features (visual and nonvisual features) contained on a web page and their related features. The algorithm implemented by Coen et al. utilized the web page URL, HTML, DOM and CSS for website classification and was able to achieve a classification accuracy of over 90%. Table 1 presents a comparison of the results yielded by the existing approaches that aim to detect a wide range of malicious websites.

Table 1

Comparison among the results of existing approaches that aim to detect a wide range of malicious websites.
Related Work	Year	Problem Scope	DB Size	Imbalance Level	Malicious Prior Probability	ACC	TPR	FPR
Amrutkar - KAYO	2017	Static analysis of mobile web pages	350K	Extremely Imbalanced	1.5%	90%	89%	8%
Mcgahagan - A Comprehensive Evaluation of Webpage Content	2019	Malicious websites detection	40K	Imbalanced	14.5%	89%	N/A	10%
Liu and Lee - CNN Based Malicious Website Detection	2020	Malicious websites detection	6K	Balanced	38.5%	N/A	93%	5.3%
Cohen – Design attributes learning	2021	Malicious websites detection	15K	Highly Imbalanced	3.5%	98%	66%	< 1%

[Table. 1 about here.]

The algorithm developed herein extends Cohen et al. work by enhancing the assessment scheme with the end-user observation stand point by capturing a complete screenshot of each web page, analyzing its color coding and performing object detection. In addition, the suggested algorithm parses 3rd-party metrics regarding web page performance and examines them thoroughly for conducting malicious website identification on a real-life large-scale dataset.

The suggested approach aims to detect malicious websites in general and is not limited to specific attack vectors or techniques. The cornerstone of this work is to leverage the built-in tradeoff that malicious websites must balance: escaping malware detection tools while attracting visitors. This conflict manifests in a considerable way when examining malicious website design attributes and comparing them to those of a legitimate website.

In terms of design attributes, we refer to all visual and nonvisual elements that a web page consists of [24]. Among these attributes, we can find HTML code and hierarchies, JavaScript, CSS, color tables, styles, font types, objects, etc. [46–49]. In addition, we also refer to the actual appearance of the website once its content is loaded and rendered. As a result, the suggested approach is a hybrid technique that enhances the static analysis method with aspects of dynamic analysis to capture and represent the actual content and appearance of a website.

By extracting design attributes from both the website’s source code and the end-user observation standpoint, the suggested approach enables the identification of hidden patterns and mismatches between what exists behind the scenes and what is actually displayed on the screen.

To propose a robust and sustainable classification approach for malicious websites, any method must be evaluated on a high-quality large-scale website dataset that accurately demonstrates the high variance in the internet space and represents the low prior probability of actually being a malicious website based on the low percentage of malicious websites in the real world [25].

Assembling such a dataset is a complex operation that consists of three main phases: assembling an appropriately labeled list of websites, extracting relevant features from each website, and enriching the extracted features with third-party data sources. As part of this study, a dataset with 35,707 website records was created. As part of the preprocessing stage, each URL was accessed by an automated scraper to extract the design attributes and enrich the website’s data with 3rd-party data. Feature selection and dimensionality reduction were then applied according to the relevant trained model. After the model was effectively tuned and evaluated, a convergence experiment was performed to simulate the interaction between the model prediction and a security practitioner that accordingly takes action. The model output includes explainability, which enables better interpretation of the obtained prediction and allows the user to provide feedback that tunes the model according to his or her preferences. The full flow is specified in Fig. 1.

[Fig. 1 about here.]

3.1 Experimental Settings

The operation of assembling a labeled list of websites often starts by creating a large sample of websites using an external ranking system to capture a list of top-ranked websites. While capturing a nonskewed sample that correctly represents the variety of the internet is of great importance, popular ranking systems are subjected to manipulations in a way that potentially skews the conclusions made in studies [26].

To prevent such skewness in this study, the Tranco Top Site Ranking system was used as a data source for creating an initial website list containing 35,707 websites. The Tranco Top Site Ranking system has evaluated different popular ranking systems to reduce the fluctuations that occur when composing a ranked list, thereby allowing the research community to work with reliable and reproducible rankings [26].

This initial list of websites was enriched by the “Google Safe Browsing” (GSB) DB to accurately classify and tag each website. GSB classifies a malicious URL into one of the following five classes: “Malware”, “Social engineering”, “Unwanted software”, “Potentially harmful application”, and “Threat type unspecified”.

Overall, 697 websites that appeared on the Tranco Top Site Ranking system were classified as “Malware” by GSB. The rest of the websites were not labeled by GSB as risky and were treated as benign websites. Accordingly, the prior probability of a website in this dataset being malicious was 1.95%. While this value properly represents the prior probability for a malicious website in a real-life scenario, this probability produced a significant challenge in the data analysis phase.

This paper extended the study of Cohen et al. [24] by developing an algorithm that automatically extracts websites’ features, including full screenshots and image analysis capabilities, in a large-scale operation and enriches each website record with third-party data regarding its operation and metadata. Advanced machine learning (ML) classification models were then applied to determine whether each website is malicious. In particular, the proposed algorithm allowed each website to be accessed to properly extract its design attributes after it has loaded and rendered its content to represent the end-user observation standpoint. Figure 2 emphasizes such extraction. The algorithm also captured a full screenshot of the website, identified its color scheme and performed image analysis to identify meaningful objects that were being used. In addition, direct enrichment with Alexa services was added to extract traffic-related features for each website.

[Fig. 2 about here.]

The extracted screenshots were analyzed using the You Only Look Once (YOLO) system [50] together with the Vision API framework. YOLO-V3 is a real-time object detection algorithm consisting of a CNN. This framework was selected due to its ability to provide good results for different types of datasets, the fact that it is far less likely to predict false detection results than another approaches [51] and its proven ability to perform faster than additional leading frameworks such as Faster region-based CNN (R-CNN) [52–52]. The vision API was selected based on its ability to represent image contents using structured labels.

The screenshots were analyzed from 2 main perspectives to extract meaningful features: object detection and content classification. Object detection involved detecting meaningful objects in an image and determining their positions. In addition, whether each object was seen immediately or required scrolling to become visible was captured. Content classification involved the identification of explicit content such as adult content or violent content within an image using the Vision API [55, 56]. The output of the abovementioned feature extraction process was a structured dataset containing the detected images and labels for each website.

On the infrastructure level, the algorithm was enhanced to support such large-scale operations. The proposed algorithmic engine was designed to perform a full scan of one website within a few seconds. It is important to emphasize that built-in waiting times were defined as part of the algorithm to ensure that the website content was loaded and rendered effectively. Due to the use of parallel computing, the execution of this algorithm for 35,707 websites, including design and schema attribute extraction, screenshot capturing, color distribution determination, and traffic data enrichment, took approximately 23 hours (with a mean time of 14.1 seconds per website). Following this stage, a complementary image analysis operation was performed offline. This operation used parallel computing as well and took approximately 25 hours (with a mean time of 30 seconds per website).

Overall, the algorithm’s output was a structured

tabular dataset containing 35,707 website records, where each record consisted of 2,900 features. It is important to emphasize that this website dataset consisted of various websites with different geolocations, languages, and web technologies that face different audiences. This variety was essential for capturing the real-life complexity of the worldwide web. As a result, the built-in variance and the “noise” in this dataset were claimed to be significant.

Two different ML model types were trained and validated on the collected dataset: an artificial neural network (ANN) and an ensemble classifier consisting of decision trees. Both classifiers were trained using 5-fold cross validation to better utilize the collected data, reduce overfitting and generalize the model predictions [57].

Due to the efficiency of deep learning models, the suggested approach was tested on an ANN [58] with multiple hidden layers. Neural networks with different hidden layer architectures were trained using 5-fold validation. All networks resulted in high accuracy (above 97.8%) and low FPRs (0.2%) but were only able to correctly identify low rates of malicious websites (2.5-4%). In addition, an analysis of the receiver operating characteristic (ROC) curves yielded by the different network architectures indicates that the trained models were affected by overfitting.

This is not surprising, as the effect of class imbalance on neural network classification performance was previously proven to be detrimental [59, 60]. Two main approaches were previously suggested for handling imbalanced classification problems while using ANN models by significantly increasing the weight of the minority class: supervised oversampling [61] and synthesized data augmentation [62–64]. These approaches share one main disadvantage — active manipulation of the original dataset. Such manipulation contradicts the intention of accurately representing a commercial, real-life scenario. In addition, it has been previously suggested that data augmentation techniques do not learn the target distribution [65].

As a proof of concept regarding ANN efficiency in this problem space and to neutralize the imbalanced effect induced without adding synthesized data, a more balanced dataset was examined.

This dataset was a subset of the original dataset consisting of 2,788 samples, including all 697 original malicious samples and 2,901 legitimate samples that were randomly selected. The prior probability of being a malicious website in this dataset was 13 times greater than that in the previous experiment (25% vs. 1.95%).

To adapt the ANN model to a smaller dataset and prevent overfitting, the feature space and the network architecture needed to be reduced. Accordingly, the trained ANN consisted of 2 hidden layers, and principal component analysis (PCA) was used for dimensionality reduction, resulting in a feature space consisting of 50 components. As expected, the ANN model functioned significantly better on the more balanced dataset and yielded better classification results (recall: 0.786; precision; 0.734; accuracy: 0.876). The balanced model accuracy was inferior to the imbalanced model accuracy, a fact that can be satisfactorily explained by the significant difference between their minority class prior probabilities.

Another effort to address the crucial effect of class imbalance was made. A sequential NN was trained with oversampling, and the weight for the minority class of malicious websites was increased. As presented in Table 2, this model was able to detect 75% of malicious websites and achieved higher accuracy (94%) and a lower FPR (0.06) than the balanced ANN model (0.09), as presented in Table 2. This FPR level is similar to those of previously reported methods and is not sufficient for a real-life scenario.

Table 2

Performance comparison among the neural network models.
ModelDataset characteristics	Minority class prior probabilityAccuracy	TPRFPR
ANNImbalanced	1.95%0.98	0.040.002
ANNBalanced	25%0.87	0.770.09
RNNImbalanced with oversampling	1.95%0.94	0.750.06

[Table 2 about here.]

Ensemble models achieve high accuracy by combining a number of base estimators and can increase the reliability of machine learning relative to a single estimator [66].

Bagging (bootstrap aggregation) is a commonly used ensemble classification method that reduces the variance of a decision tree and addresses classification noise [67]. In situations with substantial classification noise, bagging has been found to be superior to boosting and randomization [68]. The algorithm randomly creates several subsets of the given training set data by sampling with replacement. Every subset is used to train different decision trees, and each different prediction is aggregated into an averaged aggregated prediction. As opposed to the random forest classifier, the bagging classifier does not use a subset of the dataset features and, as a result, can leverage the most significant features for all of its weak classifiers.

The chosen bagging classifier consisted of 10 base estimators. Each estimator was a classification and regression decision tree (CART) with a maximal depth of 4, adjusted to an extremely imbalanced dataset by enforcing a high penalty for classification mistakes produced on the minority class. The CART algorithm was selected due to its advantage in identifying the splitting variables based on searching through all possibilities among the input variables and its ability to leverage its results for explainability purposes. The model was able to successfully detect 75% of all malicious websites while maintaining a low FPR rate of 2% and achieving an overall accuracy of 97.5%, as described in Table 3. Allowing a higher FPR resulted in a higher TPR while maintaining a high accuracy level, as shown in Fig. 3.

Table 3

Performance metrics for the bagging classifier and the convergence experiment.
	Recall	Specificity	Precision	NPV	FPR	FNR	Accuracy
Bagging Classifier	0.75	0.98	0.41	0.99	0.02	0.25	0.97
Convergence Experiment	0.83	0.97	0.44	0.99	0.02	0.25	0.97

[Fig. 3 about here.]

To provide explainability for both ML model results and understand the impact of each feature, an implementation of Shapley values [69] for explainable AI was made. Shapley values interpret the impact of having a certain value for a given feature in comparison to the prediction that would have been made if that feature took some baseline value [70]. This implementation also calculates the aggregated contribution in a way that provides insights on a model level.

When analyzing a specific prediction, one can learn what features contributed most and their actual values, as shown in Fig. 4. In this specific example, the model prediction was ‘1’ (‘1’ represents a malicious website). The biggest impact came from the 3rd-party usage statistics, the high number of images and the relatively high number of HTML elements compared to the baseline values.

[Fig. 4 about here.]

A deeper dive into the predictions classified as false positives (FPs) revealed that in many cases, malicious websites were not identified by GSB. Out of the predictions considered FPs, 100 websites were randomly selected and manually reviewed. 18% of these websites were identified as malicious, while additional 21% were identified as suspicious. Among these websites, we could find a variety of websites that contained content proposals that might raise suspicion, from software downloads via prescription medicine through engagement with human beings, as demonstrated in Fig. 5.

[Fig. 5 about here.]

Accordingly, we conducted a convergence experiment in which the model predictions were reviewed and relabeled. Each prediction classified as an FP was reviewed by going over the explainable AI results and by manually accessing the source code and screenshot of the corresponding website. Then, the allegedly malicious websites were relabeled, and the model was retrained. We learned that the 18% assumption regarding FP predictions that would actually be malicious websites continued to exist through the iterations. However, every iteration discovered additional malicious websites that were classified by GSB as legitimate and as a result the experiment did not converge after 5 iterations.

Taking that into consideration, it is reasonable to claim that the model performance was actually higher than that indicated by the above performance metrics. Examining the first iteration of the convergence experiment, as demonstrated in Table 3, the results imply that the model performance will actually be higher in a real-life scenario and that 83% of the malicious websites can be detected by the proposed approach under the same FPR and accuracy measurements.

[Table 3 about here.]

Detecting malicious websites is a challenging and never-ending task. Different techniques and approaches have been suggested to tackle this problem; however, these approaches tend to suffer from a built-in problem: an attacker can use the exact same detection technique to enhance his or her attack vector and evade the defense mechanisms.

The proposed algorithm leverages a fundamental conflict in malicious website operation and widely leverages website design attributes to perform classification.

The suggested approach was validated on a real-life, large-scale and extremely imbalanced dataset. This approach could effectively detect more than 83% of malicious websites while maintaining a low FPR of 2%. In addition, it was proven effective in detecting malicious and suspicious websites that had previously allegedly slipped under the radar.

The suggested framework offers explainability and can leverage the cybersecurity practitioner’s experience and feedback to perform better and respond to emerging threats. In this case, a potential lift of 10% can be achieved relative to a model that does not leverage any end-user feedback.

There are known limitations regarding the use of this approach. First, to implement the proposed approach as part of a business process, dedicated computing resources should be assigned to host and run the algorithm. In addition, future model experiments and enhancements require assembling an additional collection of extracted features from a new set of URLs to accurately represent an up-to-date commercial real-life malicious website identification scenario.

Additional research can be conducted to extend and enhance the suggested framework to provide a classification method of malicious websites into categories such as phishing websites, financial malware, keyloggers, trojans and ransomware. The ability to use the developed framework for multiclass website classification can be generalized to address additional business cases outside the cybersecurity domain.

Acknowledgements

This paper was partially supported by the Koret Foundation Grant for Digital Living 2030.

Compliance with Ethical Standards

Competing Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Research Data Policy and Data Availability Statements

The datasets generated and analysed during the current study are available in the Mendeley Data repository, https://data.mendeley.com/datasets/thd5mhns6c/1, doi:10.17632/thd5mhns6c.1

Xu L, et al. (2014) An evasion and counter-evasion study in malicious websites detection. IEEE
Heiderich M, Frosch T, Holz T (2011) IceShield: detection and mitigation of malicious websites with a frozen DOM. In: Proceedings of the 14th international conference on Recent Advances in Intrusion Detection (RAID’11), Springer-Verlag, pp 281–300
Liu L, et al. (2012) Chrome Extensions: Threat Analysis and Countermeasures. NDSS
Toreini E, et al. (2019) DOMtegrity: ensuring web page integrity against malicious browser extensions. International Journal of Information Security 18:801–814
Dewald A, Holz T, Freiling FC (2010) ADSandbox: Sandboxing JavaScript to fight malicious websites. proceedings of the 2010 ACM Symposium on Applied Computing
Kishore K, Ravi (2014) Browser JS Guard: Detects and defends against Malicious JavaScript injection based drive by download attacks. The Fifth International Conference on the Applications of Digital Information and Web Technologies
Egele M, Scholte T, Kirda E, Kruegel C (2008) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv 44(6)
Bat-Erdene, Munkhbayar (2017) Entropy analysis to classify unknown packing algorithms for malware detection. International Journal of Information Security 16:227–248
Liu X, Lin Y, Li H, Zhang J (2020) A novel method for malware detection on ML-based visualization technique. Computers & Security 89:101682–101682
Fang Y, Huang C, Su Y, Qiu Y (2020) Detecting malicious JavaScript code based on semantic analysis. Computers & Security 93:101764–101764
Verma R, Das A (2017) What’s in a url: Fast feature extraction and malicious url detection. Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics
Kim S, Kim J, Kang BB (2018) Malicious URL protection based on attackers’ habitual behavioral analysis. Computers & Security 77:790–806
Chiba D, Akiyama M, Yagi T, Hato K, Mori T, Goto S (2018) DomainChroma: Building actionable threat intelligence from malicious domain names. Computers & Security 77:138–161
Darling M (2015) A lexical approach for classifying malicious URLs. 2015 international conference on high performance computing & simulation (HPCS)
Lin KY (2010) Defense automatic malicious tools based on navigation behavior. 24th IEEE International Conference on Advanced Information Networking and Applications 13:17–27
Wen S, Zhao Z, Yan H (2018) Detecting malicious websites in depth through analyzing topics and webpages. Proceedings of the 2nd International Conference on Cryptography, Security and Privacy
Suh-Lee C, Jo JY, Kim Y (2016) Text mining for security threat detection discovering hidden information in unstructured log messages. 2016 IEEE Conference on Communications and Network Security (CNS)
Singh J, Singh J (2020) Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms. Information and Software Technology 121
Mimura M, Ito R (2022) Applying NLP techniques to malware detection in a practical environment. International Journal of Information Security 21:279–291
Bernardi ML, Cimitile M, Distante D, Martinelli F, Mercaldo F (2019) Dynamic malware detection and phylogeny analysis using process mining. International Journal of Information Security 18(3):257–284
Kammerstetter M, Platzer C, Wondracek G (2012) Vanity, cracks and malware: Insights into the anti- copy protection ecosystem. Proceedings of the 2012 ACM conference on Computer and communications security
Zhang X, Wang Y, Mou N, Liang W (2014) Propagating Both Trust and Distrust with Target Differentiation for Combating Link-Based Web Spam. ACM Trans Web 8(15):33–33
Pavlou PA, Liang H, Xue Y (2007) Understanding and mitigating uncertainty in online exchange relationships: a principal-agent perspective. MIS Q 31:105–136
Cohen D, Naim O, Toch E, Ben-Gal I. Website categorization via design attribute learning. Computers & Security. 2021 Aug 1;107:102312.
Invernizzi L, Comparetti PM, Benvenuti S, Kruegel C, Cova M, Vigna G (2012) Evilseed: A guided approach to finding malicious web pages. In: 2012 IEEE symposium on Security and Privacy, pp 428– 442
Pochat VL, Goethem TV, Tajalizadehkhoob S, Korczyński M, Joosen W (2018) Tranco: A research- oriented top sites ranking hardened against manipulation. arXiv preprint arXiv:180601156
Sahoo D, Liu C, Hoi SC (2017) Malicious URL detection using machine learning: A survey. arXiv preprint arXiv:170107179
Kim BI, Im CT, Jung HC (2011) Suspicious malicious web site detection with strength analysis of a javascript obfuscation. International Journal of Advanced Science and Technology 26:19–32
Kolbitsch, Clemens, et al. "Rozzle: De-cloaking internet malware." 2012 IEEE Symposium on Security and Privacy. IEEE, 2012.
Shibahara T, Yamanishi K, Takata Y, Chiba D, Akiyama M, Yagi T, Ohsita Y, Murata M (2017) Malicious URL sequence detection using event denoising convolutional neural network. 2017 IEEE International Conference on Communications (ICC) (pp. 1–7). IEEE.
Moshchuk A, Bragin T, Gribble SD, Levy HM (2006) A crawler-based study of spyware on the web. Proceedings of Network and Distributed System Security Symposium (NDSS)
Wang YM, Beck D, Jiang X, Roussev R (2006) Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In: IN NDSS
Eshete AB, Villafiorita K, Weldemariam (2013) Binspect: Holistic analysis and detection of malicious web pages. Security and Privacy in Communication Networks
Ma J, Lawrence K, Saul S, Savage GM, Voelker (2011) Learning to detect malicious urls. ACM Transactions on Intelligent Systems and Technology
Verma R, Dyer K (2015) On the character of phishing URLs: Accurate and robust statistical learning classifiers. In: Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, pp 111–122
Zhang, Yue JI, Hong LF, Cranor (2007) Cantina: a content-based approach to detecting phishing web sites. Proceedings of the 16th international conference on World Wide Web
Provos N, Mavrommatis P, Rajab M, Monrose F (2008) All your iframes point to us
Guang X, Jason O, Carolyn PR, Lorrie C (2011) CANTINA+: A feature-rich machine learning framework for detecting phishing web sites. Proc ACM Transactions on Information and System Security pp 1–28
Whittaker C, Ryner B, Nazif M (2010) Large-scale automatic classification of phishing pages
Altay B, Dokeroglu T, Cosar A (2019) Context- sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection. Soft Computing 23(12):4177–4191
Saini A, Gaur MS, Laxmi V, Conti M (2019) You click, I steal: analyzing and detecting click hijacking attacks in web pages. International Journal of Information Security 18(4):481–504
Huang LS, Moshchuk A, Wang HJ, Schecter S, JacKson C (2012) Clickjacking: Attacks and defenses. In: 21st USENIX Security Symposium (USENIX Security 12), pp 413–428
Amrutkar C, Kim YS, Traynor P (2017) Detecting mobile malicious webpages in real time. IEEE Transactions on Mobile Computing 16(8):2184–2197
Mcgahagan J, Bhansali D, Pinto-Coelho C, Cukier M (2019) A Comprehensive Evaluation of Webpage Content Features for Detecting Malicious Websites. 2019 9th Latin-American Symposium on Dependable Computing (LADC) pp 1–10
Liu D, Lee JH (2020) CNN Based Malicious Website Detection by Invalidating Multiple Web Spams. IEEE Access 8:97258–97266
Morales PR, Tabourier L, Ung S, Prieur C (2019) Role of the website structure in the diversity of browsing behaviors. In: Proceedings of the 30th ACM Conference on Hypertext and Social Media, pp 133– 142
Duckett J (2011) HTML & CSS: design and build websites, vol 15. Wiley Indianapolis,IN
Nixon R (2012) Learning PHP, MySQL, JavaScript, and CSS: A step-by-step guide to creating dynamic websites
Chen M, Ryu YU (2011) Facilitating effective user navigation through website structure improvement. IEEE transactions on knowledge and data engineering 25(3):571–588
Redmon J, Farhadi A (2018) Yolov3: An incremental improvement. arXiv preprint arXiv:180402767
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Nie Y, Sommella P, O’Nils M, Liguori C, Lundgren J (2019) Automatic detection of melanoma with yolo deep convolutional neural networks. In: 2019 E-Health and Bioengineering Conference (EHB), pp 1– 4
Wang H, Yang G, Li E, Tian Y, Zhao M, Liang Z (2019a) High-Voltage Power Transmission Tower Detection Based on Faster R-CNN and YOLO-V3. In: Chinese Control Conference (CCC), pp 8750–8755
Wang Z, Zhang J, Zhao Z, Su F (2020) Efficient Yolo: A Lightweight Model For Embedded Deep Learning Object Detection. In: 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp 1–6
Chen SH, Chen YH (2017) A content-based image retrieval method based on the google cloud vision api and wordnet. In: Asian conference on intelligent information and database systems, Springer, pp 651– 662
Bosch OJ, Revilla M, Paura E (2019) Answering Mobile Surveys With Images: An Exploration Using a Computer Vision API. Social Science Computer Review 37:669–683
Rodriguez JD, Perez A, Lozano JA (2009) Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE transactions on pattern analysis and machine intelligence 32(3):569–575
Hassoun MH (1995) Fundamentals of artificial neural networks. MIT press
Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106:249–259
Wang S, Liu W, Wu J, Cao L, Meng Q, Kennedy PJ (2016) Training deep neural networks on imbalanced data sets. 2016 International Joint Conference on Neural Networks (IJCNN) pp 4368–4374
Nguyen GH, Bouzerdoum A, Phung SL (2008) A supervised learning approach for imbalanced data sets. 19th International Conference on Pattern Recognition pp 1–4
Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C (2018) Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:180309655
Ali-Gombe A, Elyan E (2019) MFC-GAN: class- imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing 361:212–221
Wang J, Li S, Han B, An Z, Bao H, Ji S (2019b) Generalization of Deep Neural Networks for Imbalanced Fault Classification of Machinery Using Generative Adversarial Networks. IEEE Access 7:111168–111180
Arora S, Zhang Y (2017) Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:170608224
Taherkhani A, Cosma G, Mcginnity TM (2020) AdaBoost-CNN: An adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning. Neuro-computing 404:351–366
Abellán, J., Masegosa, A. R. (2010, February). Bagging decision trees on data sets with classification noise. In International Symposium on Foundations of Information and Knowledge Systems (pp. 248–265). Springer, Berlin, Heidelberg.
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine learning 40:139–157
Hart, S. (1989). Shapley value. In Game theory (pp. 210–216). Palgrave Macmillan, London.
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee SI (2020) From local explanations to global understanding with explainable AI for trees. Nature machine intelligence 2(1):56–67

No competing interests reported.

Download PDF

Journal Publication

published 24 Mar, 2023

Read the published version in International Journal of Information Security →

Editorial decision: Major revision
12 Dec, 2022
Reviewers agreed at journal
26 Nov, 2022
Reviews received at journal
23 Nov, 2022
Reviewers agreed at journal
23 Nov, 2022
Reviewers invited by journal
16 Oct, 2022
Editor assigned by journal
16 Oct, 2022
Submission checks completed at journal
13 Oct, 2022
First submitted to journal
12 Oct, 2022

You are reading this latest preprint version

Malicious Website Identification Using Design Attribute Learning

Status:

Journal Publication

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 Methodology

3.1 Experimental Settings

4 Results

5 Conclusions

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1