The results of the computational experiments obtained with the application of the seven steps are presented below, shown in figure 3.
a) Step A - Selection of the Google Hacking Database: At this step, Google Hacking Database (GHDB) from offensive Security was selected because it is an online base. It was necessary to copy the Dorks from the site and export them to a .csv file. The base has a total of 14 categories of Dorks.
For this experiment, the Dorks of the categories: “Advisories and Vulnerabilities” and “Files Containing Juicy Info” were selected as a sample. These categories were chosen because they have the largest number of Dorks, respectively with 1996 and 450 Dorks.
b) Step B – Removing Attributes and Hyperlinks: In the Excel, the hyperlink from the Dorks and the author and date attributes from the base were removed, as these attributes do not influence the Dorks from the base. In this way, the base was left with 2 attributes remaining: Dork and Category.
c) Step C – Removing the Site Parameter in Dorks: Specific Dorks became Dorks capable of running on any site. For this purpose, the Site parameter was removed from the Dorks that had it. To remove the Site parameter, the Excel software was used and searched for the parameter: “Site:”.
After finding the Dorks that contained the “Site” parameter, these Dorks were modified, removing this parameter. Among the Dorks that had the “Site:” parameter, specific Dorks were found for Proxy Sites, Google Drive, Github, Mediafire, Dropbox, Sourceforge, and eBay.
d) Step D – Removing Outliers and Stopwords: At this step, when analyzing the Dorks, it was noticed that few had more than 100 characters in their composition. These Dorks had more than 100 characters for two main reasons: Composite Dorks and URLs. Thus, they were considered in this experiment as Outliers.
Composite Dorks are Dorks that have more than one Dork in their String. URLs are links to specific vulnerabilities on certain websites. Dorks that dealt with URLs were removed, as there would be no way to make them generic and thus automatically run them on other web pages. Composite Dorks were divided into smaller Dorks and then added to the base in their respective categories.
Then, the removal of Stopwords was performed to reduce noise at the base. This was necessary because in the GHDB database there are some Dorks with special characters that, when converted to their numerical value, have a value very different from the alphanumeric characters. To perform the removal of Stopwords, we defined 40 special characters as Stopwords to be removed. The removed Stopwords were as follows: ,':;"’!?”()`@~/|*[]^_.+\#%¨¬&©ºª}{£¢§-⌂
e) Step E – Enrichment with Natural Language Processing: In this step, the base Dorks were selected and divided by characters applying tokenization by NLP. It then made each character an attribute in the base. This was necessary because the base had only two attributes so far: Dork and Category. The low number of attributes makes it impossible to apply ML techniques on this basis.
To enrich this base, that is, add new attributes, an algorithm was developed in Python to discover the Dork with the greatest number of characters in its composition, and thus, create the same number of attributes in the Dorks Base. Thus, you can divide the Dork into characters and create new attributes in the base. This action not only enriches the base but also avoids in the next step of the experiment - F, when the Dork is converted to its numeric value in ASCII, that the numeric values obtained from the conversion are extensive, thus making impossible the application of ML techniques.
For example, a 10-character Dork, when converted to its numeric value, becomes a 30-digit numeric value. This is because each character converted to ASCII has a 3-digit numeric value. on the other hand, if each base attribute has only a single character, each attribute will receive a numeric value of 3 digits, enabling the application of intelligent techniques in the base.
Thus, 94 attributes were created in the base, named Carac01, Carac02, Carac03 to Carac94. Thus, the database now has a total of 95 attributes, 94 “Carac” attributes added to the Category attribute with numerical values defined in steps C. The Dork division was performed through the “Nltk.word_tokenize()” function.
f) Step F – Base Transformation: After applying NLP in phase E, the Dorks characters were converted to their numerical values. For this, we selected the Dorks characters and converted them to their respective.
To conduct this conversion, the study by Guo et al. (2018) converts characters to their numeric ASCII value to detect Memory Overflow vulnerabilities. To conduct this conversion, the “ord( )” function of the Python language was used, the same function used in Guo's study. For example, the Dork:
inurl:/phpmyadmin/index.php?db=
in step D, this Dork was processed along with the other Dorks in the base, and thus, the special characters were removed. So, this Dork became:
inurlphpmyadminindexphpdb.
Then, in step E, the Dorks were divided by characters, in this way, this Dork now has 25 characters, and that character was assigned to an attribute. In this way, the first character of this Dork: “i” was assigned to the attribute: Char01; the second character of this Dork: “n” was assigned to the attribute Char02 and so on until the end of the Dork. The other attributes received a value of 0 in order not to keep the base with null values.
In this phase F, this Dork had its characters converted to its numeric value in ASCII. Thus, the characters of this Dork now have the following value:
105 110 117 114 108 112 104 112 109 121 97 100 109 105 110 105 110 100 101 120 112 104 112 100 98
g) Step G – SOM Application: After enriching and transforming the Dorks base, SOM was applied to validate the enrichment and conversion performed on the Dorks base, the possibility of applying ML techniques, and finding vulnerabilities in the generated clusters. For this, we sought to extract knowledge from the Dorks base with the application of SOM. To perform the SOM, we defined the map dimension with 225 neurons, that is, a 15x15 map, and hexagonal topological neighborhood. In addition, the parameters used in the training phase were number of epochs (iterations) equal to 3000 and learning rate equal to 0.5 [3].
For this experiment, all Dorks from the categories: “Advisories and Vulnerabilities” and “Files Containing Juicy Info” were selected as a sample. The two categories have the highest number of Dorks. The “Advisories and Vulnerabilities” category has a total of 1,996 Dorks, who search web pages with unprotected files. The application of SOM generated a map with 3 groups. This map is shown in figure 4.
The application of the SOM network generated three groups in the Advisories and Vulnerabilities base. Table 10 shows the characteristics of each one of them.
Table 10 Map Characteristics in the Advisories and Vulnerabilities Category
Cluster
|
Color
|
Dorks
|
Total (%)
|
Vulnerabilities
|
C1
|
Blue
|
1092
|
54,71%
|
online Devices
|
C2
|
Red
|
622
|
31,16%
|
URL requests
|
C3
|
Yellow
|
282
|
14,13%
|
Multiple URL requests
|