Google Hacking Database Attributes Enrichment and Conversion to Enable the Application of Machine Learning Techniques

doi:10.21203/rs.3.rs-1995597/v1

In the initial phase of the pentest, named Open Source Intelligence, we use passive recognition with Google Hacking. Google Hacking is a practice that uses strings called Dorks. To support them, the Google Hacking Database is available with thousands of Dorks. However, the Google Hacking Database contains a reduced number of attributes, all with textual values, which makes it impossible to apply Machine Learning techniques. one way to enrich the Google Hacking Database with attributes is with Natural Language Processing and the transformation of textual values to numeric, converting Dorks characters to ASCII. So, the objective was to apply Natural Language Processing to enrich Google Hacking Database with attributes and convert its textual values to ASCII, to enable the application of Machine Learning techniques. The computational experiments were conducted in seven steps: Selection of the GHDB Base, Removal of Hyperlinks and Deletion of Attributes, Removal of the Site Parameter from Dorks, Removal of Outliers and Stopwords, Enrichment with Natural Language Processing, Base Transformation, and Application of the SOM. The results obtained with the application of the SOM were considered good, depending on the values presented by the metrics that evaluated the network. Thus, it is considered that the objective of this paper was achieved.

Dorks

Google Hacking Database

Kohonen Self-organizing Map

Machine Learning

Natural Language Processing

one method that can be used to ensure information security is to discover the vulnerabilities where the information is stored. Vulnerabilities represent security flaws that pose risks to information [1]. A practice used to find vulnerabilities in web pages is Google Hacking (GH). GH works like a google search that uses a search string, called Dork, which are sets of characters used to perform a specific search on Google [2].

Dorks are used for different purposes, such as finding vulnerabilities in the structure of a website, exposed database files, active service logs, and virus-infected files. To assist the practice of GH, the Google Hacking Database (GHDB) is available on the internet, a base with Dorks evaluated and validated by offensive Security. Despite the number of Dorks, the GHDB contains few attributes, requiring that those who use it have prior knowledge. Furthermore, these few attributes of the base are textual values, not numeric, which limits its use of Machine Learning (ML) techniques, such as Artificial Neural Networks (ANNs). It is worth highlighting the importance of applying ML techniques in preventing the high number of highly complex attacks that have been taking place.

one type of ANN architecture that can be used in the Information Security area is the Kohonen Self-organizing Maps (SOM). According to Kohonen [3], the SOM network is an ANN capable of extracting knowledge from a database, considering all its attributes simultaneously and forming clusters by similarity [4].

This capability allows the SOM network to be applied in the IS area for various purposes, such as investigating digital evidence on computers and detecting anomalies in online environments. It is noteworthy that ML techniques have been used in Open-Source Intelligence practices (OSINT), such as the GH whose objective is to collect information from open sources [5].

So that ML techniques can be applied in GHDB it is necessary to enrich the base with attributes, to provide more information for the techniques to conduct their learning. Furthermore, it is also necessary to transform attributes with textual values into numeric values since artificial neural networks are mathematical models.

As for the GHDB enrichment, the Dorks can be divided by characters applying tokenization by NLP. Enrichment is the process responsible for adding information to a database, making it suitable for performing a certain task. When new information is added to a database, new facts are added to existing data, thus enabling new approaches to discover knowledge [6]. As for the attributes with textual values of the GHDB, one way to transform them into numeric is to perform a character conversion to ASCII [7]. So, the aim of this paper was to apply NLP to enrich GHDB with attributes and convert its textual values to ASCII, to enable the application of ML techniques to group Dorks by similarity and find vulnerabilities. The contributions of this paper are characterized by the description of how to apply NLP to enrich the GHDB, how to transform attributes with textual value or textual Dorks into numerical and how to apply an ANN, in this case, the SOM to group Dorks by similarity, enabling the application of an ML technique on such an important basis.

2.1 Google Hacking

Because the source code of web pages is open and accessible over the internet, it is possible to determine its version and structure just by searching for “strings,” that is, specific character sets in search engines [8]. Search engines are information retrieval tools in which users enter keywords for queries and subsequently get results automatically. one can mention, for example, these search engines: Google and Yahoo [9].

Researchers from different areas have been studying search engines in different approaches, such as in the search for products, scientific publications, and in the areas of social marketing, economics, politics, and IS [10].

Roy et al [11] presents a practice for Pentest that uses the Google search engine, to find vulnerabilities in internet pages only using specific strings, that is, only using a certain set of characters that may or may not be composed by advanced Google operators. This string used in Google to search for vulnerabilities is called Dork, while the practice for Pentest that uses Google and Dorks is called Google Hacking (GH) or Google Dorking. Table 1presents the five main items about the practice of GH.

Table 1

Five main items for the practice of Google Hacking
ID	Item	Description
01	Cache System Usage	The GH practice uses Google's caching system to go directly to a snapshot of a web page. That is, it is possible to extract information from that page without entering the domain, thus managing to consult the pages without establishing any direct connection with the destination.
02	Dorks	Dorks are strings, which can be composed by Google search operators. Dorks are used in GH's practice to find vulnerabilities or information that help reveal vulnerabilities in web pages.
03	Discovery of Network Resources	By combining search operators, the GH practice can obtain lists of addresses from servers and services available in each domain, in addition to being able to find pages that are connected to a given URL.
04	File Collection	The GH practice not only discovers vulnerabilities in the structure of a web page, but it can also discover files that are open to the public, such as password logs (explicit, hash, encrypted, etc.), logins, databases, among others.
05	Google Hacking Database	Google Hacking Database (GHDB) is a database with thousands of Dorks evaluated and validated by offensive Security.

azurczyk and Caviglione [12], and Kalech [13] address types of information that can be found with GH. Information can be server names, open directories, file copies, IP address ranges, critical information about SCADA systems, online services, and devices such as cameras and printers. To understand the impact that the practice of GH can have, Rahman et al. [14] mention the practice of GH and the vulnerabilities present in web applications. The authors discuss how easy it is to find a vulnerability or sensitive information, such as an IP address or an email address, during a search with a certain Dork available on the GHDB.

2.2 Dorks

According to Toffalini et al [2], Dorks are "strings" that can be composed of specific words and/or parameters developed for search engines to collect information about vulnerabilities or information that help the search for vulnerabilities. In the literature, various Dorks are described for different purposes, for example, to find vulnerable websites, confidential information, or exposed files.

Pan et al [15] and Quintet, Leonhardt, and Holz [16] describe in their study, categories that can be used to classify the words and parameters that make up the Dorks. Table 2presents the categories, along with their description and examples.

Table 2

Categories to classify the components of a Dork
Category	Description	Example	Operation
GRAM	Google Advanced Operators	Intext, Filetype, Intitle, Inurl, Site	Search to the structure of the Site
WEB	Vulnerabilities in Web Technologies	Phpmyadmin, Wordpress	Search to Web technologies
SCRIPT	Web Page Extensions	Php, AspX, Asp, Jsp	Search to specific types of web pages
DOC	Unprotected files	Doc, Pdf, Docx, Xls, Xlsx	Search to specific types of files indexed on web pages
BD	Database Files	Sql, Mdb, Myd	Search to specific types of databases indexed on web pages

The following Dork is exemplified, based on Table 2: Inurl: ”.gov.br ” Intext: ”Senhas.xlsx” OR “logins.doc”

Inurl and Intext belong to the GRAM category as they are advanced Google operators. They directed the search to the structure of a particular site. The Inurl parameter searches for sites that contain “.gov.br” in their URL, while the Intext parameter searches for sites that contain DOC category files named “Senhas.xlsx” or “logins.doc” in their content. Thus, this Dork can be used to search for “Passwords.xlsx” or logins.doc” files on websites that contain “.gov.br” in their URL.

According to Zhang, Notani, and Gu [17], and Mider, Garlicki, and Jan [18], the highest concentration of validated and documented Dorks in the world is available in the GHDB. It is the largest and most representative online database of Dorks in the world. A disadvantage of the base is that it has few attributes: only the text of the Dork, the author who published it in the base and category to which the Dork belongs. The Dorks available on the GHDB are classified into 14 categories, based on their functionality, that is, on the type of vulnerability they seek. The categories are shown in Table 3.

Table 3

The fourteen categories of GHDB Dorks
Dork Category	Description
Footholds	Pages that display some trace of vulnerabilities
Files Containing Usernames	Pages containing files with usernames or logins
Sensitive Directories	Pages that have sensitive unprotected directories
Web Server Detection	Pages that have unprotected information about your web server
Vulnerable Files	Pages containing unprotected files
Vulnerable Servers	Pages containing unprotected servers
Error Messages	Pages that reveal vulnerabilities through error messages
Files Containing Juicy Info	Pages containing unprotected settings files
Files Containing Passwords	Pages containing files with unprotected passwords
Sensitive online Shopping Info	E-commerce pages that display unprotected information
Network of Vulnerability Data	Pages that display vulnerable data about the structure of a network
Pages Containing Login Portals	Pages containing vulnerable login portals
Various online Devices	Pages containing unprotected online devices
Advisories and Vulnerabilities	Pages that contain vulnerabilities coming from advertisements

The use of Dorks contained in the GHDB in the GH practice allows finding vulnerabilities in Web pages already in the initial step of a Pentest, called Recognition or OSINT [18].

2.3 Natural Language Processing

with the significant growth of user-generated content on the Internet, the automatic extraction of relevant information started to receive interest from researchers from different areas. Many of these researchers are achieving this online information extraction through Natural Language Processing [19]. Natural Language Processing (NLP) is the subarea of AI responsible for making computers able to interpret and develop content in human language. As it is an interdisciplinary area, it includes other areas such as Computer Science, Linguistics, Psychology, and Statistics [20].

The application of NLP to texts or other human language source content can be performed through several tasks. Among the main tasks, the following stand out: stemming, corpus production, tokenization, lemmatization, grammatical marking, syntactic analysis, and the removal of stopwords [21–22]. The main tasks of NLP are described in Table 4.

Table 4

Main NLP Tasks
Task	Description
Stemming	Used to consolidate different variations of a word that share the same stem into a common root form. For example, the words "Like" and "Likes" will all be simplified to the root form of "Lik".
Corpus	It is the formation of a set of all the words present in a text in a single item. Also called “Text Base,” it is used in most NLP tasks.
Tokenization	Tokenization, also called "Word Segmentation", is responsible for breaking a certain sequence of characters in a text, that is, it determines where the words of a text start and end and transform them into tokens. Tokens are lists generated from a tokenized corpus.
Lemmatizing	The reduction of superficial words to their canonical form is called a lemma. The motto relates different forms of words with the same meaning. For example, the word “Best” has the word “Good” as its motto. Its use is efficient for information retrieval.
Grammatical Marking (POS)	This is a basic task in linguistics applied to the corpus. The goal is to assign morphosyntactic characteristics to each word in a sentence according to its context. It can also be applied to sentences and paragraphs.
Syntax analysis	The natural successor to grammar markup, parsing provides a dependency tree as the output of each word within a corpus. Its objective is to provide for each sentence or clause, an abstract representation of the grammatical entities and their relationships.
Removal of Stopwords	The removal of stopwords is intended to keep a more concise and cleaner corpus for future analysis. An example of application is the removal of prepositions such as: “of,” “if,” are”, “is”, etc.
Frequency	Used to produce a list of words and their frequency in each corpus. In addition, it is possible to produce word-frequency lists using a corpus marked with grammatical markup.

Some libraries and tools used to implement and develop algorithms, in addition to the main AI techniques used for NLP [19].

Table 5describes the main libraries and tools used for NLP.

Table 5

Main libraries and tools for NLP
Library	URL	Description
NLTK	http://www.nltk.org	Open-source library used to perform tokenization, opinion mining, sentiment analysis, and semantic reasoning tasks.
OpenNLP	https://opennlp.apache.org/	Library used for word processing. It supports tasks like sentence segmentation, entity recognition, and feeling analysis.
CoreNLP	http://stanfordnlp.github.io/CoreNLP/	The library is capable of advanced sentiment analysis.
Gensim	http://radimrehurek.com/gensim/	Open-source library used to model topics that address Latent Semantic Analysis (LSA) and Dirichlet Allocation (LDA).
Fundan NLP	https://code.google.com/archive/p/fudannlp/	Open-source toolkit for Chinese NLP. It supports tasks such as word segmentation, POS tagging, entity naming, and dependency analysis.
LTP	http://www.ltp-cloud.com/intro/en/	Open-source system for the Chinese language, including lexical analysis, parsing, and semantic analysis.
NiuParser	http://www.niuparser.com	Syntactic and Semantic Analysis Toolkit for the Chinese Language. It supports tokenization, POS tagging, dependency analysis, and semantic function labeling.

As for the application of NLP in the IS area, studies show a trend in its application in Pentest, in the initial step called Recognition or OSINT. The justification is that the application of NLP increases the effectiveness in discovering already published and documented vulnerabilities, such as Outdated software versions and online device configuration files [23].

2.4 Artificial Neural Networks

within Artificial Intelligence (AI) there are sub-areas such as Natural Language Processing (NLP), Computer Vision (CV), and Machine Learning (ML). the ML field is concerned with the issue of how to build computer programs that automatically improve with experience [24–25]. one ML technique that can be used to solve problems in the IS area is Artificial Neural Networks (ANNs). ANNs can be used for several tasks, such as classification, grouping, association, pattern recognition, regression, and prediction [26].

ANNs are mathematical models of artificial intelligence inspired by the structure of the brain to simulate human behavior in processes such as learning, association, generalization, and abstraction. An ANN can learn and improve its performance based on the environment in which it finds itself. ANNs are very effective in solving nonlinear problems and performing parallel processing. In addition, they can simulate complex systems, an ability that traditional computational techniques lack [27–28].

An important feature of ANNs is the ability to learn incompletely and subject to noise. Fault tolerance is part of the architecture due to the distributed nature of the processing. If a neuron fails, its incorrect output will be replaced by the other correct outputs [28]. In ANNs, learning occurs through a set of simple processing units called artificial neurons. The representation of the basic elements of an artificial neuron is shown in Fig. 1. The data (input vectors) of the neuron (x1, x2, ..., xn), the input layer neurons (wlj, ..., wnj ) with their respective weights are observed, and then the additive join or sum represented by the letter sigma, then the activation function (φ) and finally the output (y).

The activation function of the artificial neuron is performed similarly to the synapse on the biological neuron, transmitting or blocking nerve impulses. In this way, the learning of ANNs happens through weight adjustments. The weight value will be determined based on its value in the previous iteration, as shown in Eq. (1):

$${w}_{i}^{t+1}={w}_{i}^{t}+\varDelta {w}_{i}^{t}$$

1

Updating the weights depends on the algorithm, but it is based on minimizing the error between the values predicted by the network and the desired outputs, as shown in Eq. (2):

$${\epsilon }_{i}=\sum {w}_{i}{x}_{i}-{y}_{i}$$

2

As for the application of RNA in information security, it is possible to obtain interesting results, such as in the classification of malicious and phishing sites, and in the classification of traffic that exploits the vulnerability of denial of service in systems information [29–30].

2.5 Self-organizing Map

The Self-organizing Map (SOM) proposed by Kohonen [31] is a network built around a one- or two-dimensional grid of neurons to capture the important characteristics contained in an input space (data) of interest. The SOM network is an ANN based on unsupervised learning capable of processing input from a multidimensional space, transforming it into a one-dimensional or two-dimensional array. The SOM algorithm is inspired by neurobiology and incorporated all the basic mechanisms for self-organization: competition, cooperation, and self-amplification [28] [31].

The structure of the SOM network is composed of neurons interconnected by a relationship called a neighborhood. It is this relationship that determines the topology of the map. For each data provided to the SOM network, there will be a competition among all neurons for the right to represent it. The neuron that wins the competition will be the one with the weight vector with the values closest to the input vector. This type of learning is called competitive learning [4]

In Fig. 2, an example of the training phase of the SOM network is presented, simulating 16 neurons simultaneously receiving the input vector X.

When each of the X input vectors is processed by the SOM, each output neuron receives a value and calculates its activation level, according to Eq. (3),

Where X is the input vector, i is the index that indicates which neuron is receiving the input value and ^w_i s the weight vector between the input value and the neuron. The Best Match Unit (BMU) will be the neuron with the highest ^u_i, that is, the one closest to the input vector. This will be the neuron that will represent the pattern of the input vector data. The Other M neurons compete to determine which one will receive a value closer to the BMU to also remain active.

The SOM network algorithm can be synthesized in five steps [27–28], which are described in Table 06.

Table 6

Five SOM steps synthesized by Haykin
Step	Description
Beginning	Choice of random values for the weight vectors.
Choice of Input Standard	Choosing an x pattern of neurons and determining their neighborhood.
BMU Definition	Choosing the BMU neuron based on the similarity between the neuron's activation level and the input value.
Weight Update	Modification of the values of the vectors of the weights of the neurons in the network.
Continuation	Repeat steps 2, 3, and 4 until no significant changes in the map are observed.

To assess the quality of the map and analyze whether the chosen topology is the one that "best represents the input vector data "X", some quality measures can be used, such as the Quantization Error (QE) and the Topographic Error (TE) [31]. Table 7presents the description of each of these measures.

Table 7

– Accuracy measures for SOM
Measure	Description	Equation
Quantization Error (QE)	Shows the quality of the input vector data. The better the quality of the input vector, the better the arrangement of neurons on the map. The quantization error will be close to zero when all nodes are well distributed in the map.
Topographic Error (TE)	Measures topology preservation of input data. As data is moving from multidimensional space to a two-dimensional or one-dimensional space, they end up losing information. one way to evaluate the representation of the initial input vector is using topographic error. When the topographic error is close to zero, it means that all nodes represent the initial input vector well.

The literature review was performed using the following keywords: "Natural Language Processing", "Google Hacking", "GHDB", "Dorks", "Artificial Neural Networks" in the databases: ACM Digital Library, EmeraldInsight, IeeeDigitalLibrary, and ScienceDirect. The Dorks base selected was the GHDB (https://www.exploit-db.com/google-hacking-database) because the base has the largest number of documented and tested Dorks among all those available on the internet [17–18]. The GHDB has a total of 4,211 Dorks and 4 attributes, which are: Date: contains the date the Dork was published in the Base, Dork: contains the Dork and its access link, Category: informs which category the Dork belongs to, and Author: informs who sent Dork to the base. In Table 8, a sample of the GHDB base is presented.

Table 8

GHDB Sample
Date	Dork	Category	Autor
12/08/2019	intitle:Administration - Installation - MantisBT	Footholds	Mr.XSecr3t
14/06/2018	"username.xlsx" ext:xlsx	Files Containing Usernames	ManhNho
22/08/2019	intitle:"index of" /content/admin/	Sensitive Directories	Reza Abasi
02/01/2019	"dispatch = debugger."	Error Messages	deadroot

3.1 Conducting Computational Experiments

The steps of performing the computational experiments shown in Fig. 3 were based on three approaches to perform Open Source Intelligence (OSINT), as shown in Table 9.

Table 9

Three Approaches to Running Open-Source Intelligence
Approach	Authors	Year
OSINT Approach to Support Cybersecurity Operations	[32]	2018
OSINT Approach to Inspecting Critical Infrastructure Systems	[33]	2016
OSINT Approach to Obtain Intelligence Information from Cyber Threats	[34]	2018

The authors reinforce in their work that when running OSINT through an approach, together with ML techniques, it becomes possible to extract new knowledge from the discovered information.

So, the computational experiments were conducted in seven steps: Selection of the GHDB Base, Removal of Hyperlinks and Deletion of Attributes, Removal of the Site Parameter from Dorks, Removal of Outliers and Stopwords, Enrichment with Natural Language Processing, Base Transformation and Application of the SOM. Figure 3 presents the flowchart with the seven steps of computational experiments.

a) Step A - Selection of the Google Hacking Database: In this step, the Dorks GHDB base was selected to conduct the computational experiments.

b) Step B – Removing Attributes and Hyperlinks: In this step, the hyperlinks embedded in Dorks were removed, along with nominal attributes from the GHDB database that were disregarded.

c) Step C – Removing the Site Parameter in Dorks: In this step, specific Dorks became in Dorks capable of running on any site. For this, the “Site” parameter present in Dorks was removed.

d) Step D – Removing Outliers and Stopwords: In this step, the removal of Outliers and Stopwords was conducted. Removed Stopwords were special characters present in Dorks. The removed Outliers were Composite Dorks and URLs.

e) Step E – Enrichment with Natural Language Processing: In this step, the base Dorks were selected and divided by characters applying tokenization by NLP. Then, the enrichment was conducted, transforming each Dork character into an attribute.

f) Step F – Base Transformation: In this step, the base Dorks were selected and converted to their respective numerical values in ASCII.

g) Step G – SOM Application: In this step, the SOM was applied to validate the GHDB enrichment and conversion, to generate similar Dorks clusters. Its performance will be evaluated by the Quantization Error (EQ) and Topographic Error (TE) values. Good results obtained in both errors will indicate whether the enriched and converted GHDB enabled the application of ML techniques.

The results of the computational experiments obtained with the application of the seven steps are presented below, shown in figure 3.

a) Step A - Selection of the Google Hacking Database: At this step, Google Hacking Database (GHDB) from offensive Security was selected because it is an online base. It was necessary to copy the Dorks from the site and export them to a .csv file. The base has a total of 14 categories of Dorks.

For this experiment, the Dorks of the categories: “Advisories and Vulnerabilities” and “Files Containing Juicy Info” were selected as a sample. These categories were chosen because they have the largest number of Dorks, respectively with 1996 and 450 Dorks.

b) Step B – Removing Attributes and Hyperlinks: In the Excel, the hyperlink from the Dorks and the author and date attributes from the base were removed, as these attributes do not influence the Dorks from the base. In this way, the base was left with 2 attributes remaining: Dork and Category.

c) Step C – Removing the Site Parameter in Dorks: Specific Dorks became Dorks capable of running on any site. For this purpose, the Site parameter was removed from the Dorks that had it. To remove the Site parameter, the Excel software was used and searched for the parameter: “Site:”.

After finding the Dorks that contained the “Site” parameter, these Dorks were modified, removing this parameter. Among the Dorks that had the “Site:” parameter, specific Dorks were found for Proxy Sites, Google Drive, Github, Mediafire, Dropbox, Sourceforge, and eBay.

d) Step D – Removing Outliers and Stopwords: At this step, when analyzing the Dorks, it was noticed that few had more than 100 characters in their composition. These Dorks had more than 100 characters for two main reasons: Composite Dorks and URLs. Thus, they were considered in this experiment as Outliers.

Composite Dorks are Dorks that have more than one Dork in their String. URLs are links to specific vulnerabilities on certain websites. Dorks that dealt with URLs were removed, as there would be no way to make them generic and thus automatically run them on other web pages. Composite Dorks were divided into smaller Dorks and then added to the base in their respective categories.

Then, the removal of Stopwords was performed to reduce noise at the base. This was necessary because in the GHDB database there are some Dorks with special characters that, when converted to their numerical value, have a value very different from the alphanumeric characters. To perform the removal of Stopwords, we defined 40 special characters as Stopwords to be removed. The removed Stopwords were as follows: ,':;"’!?”()`@~/|*[]^_.+\#%¨¬&©ºª}{£¢§-⌂

e) Step E – Enrichment with Natural Language Processing: In this step, the base Dorks were selected and divided by characters applying tokenization by NLP. It then made each character an attribute in the base. This was necessary because the base had only two attributes so far: Dork and Category. The low number of attributes makes it impossible to apply ML techniques on this basis.

To enrich this base, that is, add new attributes, an algorithm was developed in Python to discover the Dork with the greatest number of characters in its composition, and thus, create the same number of attributes in the Dorks Base. Thus, you can divide the Dork into characters and create new attributes in the base. This action not only enriches the base but also avoids in the next step of the experiment - F, when the Dork is converted to its numeric value in ASCII, that the numeric values obtained from the conversion are extensive, thus making impossible the application of ML techniques.

For example, a 10-character Dork, when converted to its numeric value, becomes a 30-digit numeric value. This is because each character converted to ASCII has a 3-digit numeric value. on the other hand, if each base attribute has only a single character, each attribute will receive a numeric value of 3 digits, enabling the application of intelligent techniques in the base.

Thus, 94 attributes were created in the base, named Carac01, Carac02, Carac03 to Carac94. Thus, the database now has a total of 95 attributes, 94 “Carac” attributes added to the Category attribute with numerical values defined in steps C. The Dork division was performed through the “Nltk.word_tokenize()” function.

f) Step F – Base Transformation: After applying NLP in phase E, the Dorks characters were converted to their numerical values. For this, we selected the Dorks characters and converted them to their respective.

To conduct this conversion, the study by Guo et al. (2018) converts characters to their numeric ASCII value to detect Memory Overflow vulnerabilities. To conduct this conversion, the “ord( )” function of the Python language was used, the same function used in Guo's study. For example, the Dork:

inurl:/phpmyadmin/index.php?db=

in step D, this Dork was processed along with the other Dorks in the base, and thus, the special characters were removed. So, this Dork became:

inurlphpmyadminindexphpdb.

Then, in step E, the Dorks were divided by characters, in this way, this Dork now has 25 characters, and that character was assigned to an attribute. In this way, the first character of this Dork: “i” was assigned to the attribute: Char01; the second character of this Dork: “n” was assigned to the attribute Char02 and so on until the end of the Dork. The other attributes received a value of 0 in order not to keep the base with null values.

In this phase F, this Dork had its characters converted to its numeric value in ASCII. Thus, the characters of this Dork now have the following value:

105 110 117 114 108 112 104 112 109 121 97 100 109 105 110 105 110 100 101 120 112 104 112 100 98

g) Step G – SOM Application: After enriching and transforming the Dorks base, SOM was applied to validate the enrichment and conversion performed on the Dorks base, the possibility of applying ML techniques, and finding vulnerabilities in the generated clusters. For this, we sought to extract knowledge from the Dorks base with the application of SOM. To perform the SOM, we defined the map dimension with 225 neurons, that is, a 15x15 map, and hexagonal topological neighborhood. In addition, the parameters used in the training phase were number of epochs (iterations) equal to 3000 and learning rate equal to 0.5 [3].

For this experiment, all Dorks from the categories: “Advisories and Vulnerabilities” and “Files Containing Juicy Info” were selected as a sample. The two categories have the highest number of Dorks. The “Advisories and Vulnerabilities” category has a total of 1,996 Dorks, who search web pages with unprotected files. The application of SOM generated a map with 3 groups. This map is shown in figure 4.

The application of the SOM network generated three groups in the Advisories and Vulnerabilities base. Table 10 shows the characteristics of each one of them.

Table 10 Map Characteristics in the Advisories and Vulnerabilities Category

Cluster	Color	Dorks	Total (%)	Vulnerabilities
C1	Blue	1092	54,71%	online Devices
C2	Red	622	31,16%	URL requests
C3	Yellow	282	14,13%	Multiple URL requests

It is observed in Table 10 that Dorks address vulnerabilities that allow advertisements and other messages on web pages. Such vulnerabilities seek online devices and URL requests to look for sensitive information. The “Files Containing Juicy Info” category has a total of 450 Dorks that search for unprotected files with information about other systems on web pages. The application of SOM generated a map with 4 clusters. This map is shown in figure 5.

The application of SOM generated four groups in the “Files Containing Juicy Info” base.

Table 11 shows the characteristics of each one of them.

Table 11 Map Characteristics in the Files Containing Juicy Info Category

Cluster	Color	Dorks	Total (%)	Vulnerabilities
C1	Green	6	1,33%	Querys SQL Dump
C2	Red	135	30,00%	Email Servers
C3	Blue	228	50,67%’	Text Files and Github Files
C4	Yellow	81	18,00%	Files proj and Netscape

It is observed in Table 11 that Dorks address vulnerabilities that allow exploiting unprotected files with information about other systems on web pages. These vulnerabilities are of various extensions, spanning technologies such as SQL and Netscape. Then, the quality of the map generated by the SOM was evaluated through Quantization Error (QE) and the Topographic Error (TE). Values are shown in table 12.

Table 12 Results of the metrics of the maps generated by the SOM network

Base	QE	TE
Advisories And Vulnerabilities	0,0990822	0.008849558
Files Containing Juicy Info	0,0202605	0.01703407

Analyzing the results in table 12, the errors had values close to 0. This means that the topology of the input data was preserved, that is, that all nodes well represented the initial input vector. Thus, the errors obtained in the application of SOM can be considered as good. Thus, it is understood that the enrichment of the GHDB database with NLP, together with the conversion of Dorks characters to numeric values in ASCII made it possible to apply an ML technique to generate similar Dorks groupings and to identify vulnerabilities.

This paper applied NLP to enrich the attributes of GHDB and convert its textual values into numeric values, using ASCII code to apply ML techniques. Therefore, the developed computational experiments included seven steps, which culminated in the validation by SOM of the GHBD enrichment and conversion, in addition to the generation of clusters with similar Dorks and the identification of vulnerabilities.

The results obtained with the application of the SOM were considered good, depending on the values presented by the metrics that evaluated the network. Thus, it is considered that the objective of this paper was achieved.

with the base enriched and converted, it becomes possible to use other ML techniques to automate information security tests, such as in the construction of OSINT approaches or even for the creation of rules for defense systems such as Firewalls, IDS, and IPS, making them those capable of detecting GHDB practices.

Among the limitations observed in this paper, the definition of stopwords stands out, because it does not find a pre-defined set of special characters, and the lack of studies in the literature to compare the results since the phases of conducting computational experiments was inspired by three different approaches.

The study conducted here does not intend to exhaust the subject, on the contrary, it sought to contribute to the Information Security area about the application of ML techniques in the identification of vulnerabilities when enriching and converting the GHBD. It is expected that the phases presented and applied in computational experiments can stimulate further research. This scenario, therefore, offers ample room for continuation work.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Research Data Policy and Data Availability Statements

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Acknowledgements

We would like to thank Universidade Nove de Julho for supporting this research.

Dobrovoljc, Andrej; Trček, Denis; Likar, Borut. Predicting Exploitations of Information Systems Vulnerabilities Through Attackers’ Characteristics. Ieee Access, P. 26063-26075, 2017.
Toffalini, Flavio; Abbà, Maurizio; Carra, Damiano; Balzarotti, Davide. Google Dorks: Analysis, Creation, And New Defenses. Lecture Notes in Computer Science. V. 9721, P. 255-275, 2016.
Kohonen, Teuvo. Exploration of Very Large Databases by Self-organizing Maps. Proceedings of International Conference on Neural Networks (Icnn'97). Ieee, P. 1-6. Vol. 1. 1997.
López, Alberto Urueña; Mateo, Fernando; Navío-Marco, Julio; Martínez-Martínez, José María; Gomes-Sanchís, Juan; Vila-Francés, Joan; Serrano- 136 López, Antonio José. Analysis of Computer User Behavior, Security Incidents and Fraud Using Self-Organizing Maps. Computers & Security, V. 83, P. 38-51, 2019.
Evangelista, João Rafael Gonçalves; Sassi, Renato José; Romero, Márcio; Napolitano, Domingos. Systematic Literature Review to Investigate the Application of Open-Source Intelligence (Osint) with Artificial Intelligence. Journal of Applied Security Research, P. 1-25, 2020.
Fayyad, U. M.; Piatetsky-Shapiro, G.; Smith, P. The Kdd Process for Extracting Useful Knowledge from Volumes of Data. Comunications of the Acm, V.39, P.27-34, 1996.
Guo, Hui; Huang, Shu-Guang; Pan, Zu-Lie; Hu, Jian-Ping; Hu, Ming-Lei. Research on Key Data Structure Localization Technology of Buffer Overflow Vulnerability. In: Proceedings of the 2018 International Conference on Information Science and System. P. 81- 85, 2018.
Mansfield-Devine, Steve. Taking Responsibility for Security. Computer Fraud & Security, V. 2015, N. 12, P. 15-18, 2015.
Chao, Chih-Yang; Chang, Tsai-Chu; Wu, Hui-Chun; Lin, Yong-Shun; Chen, Po-Chen. The Interrelationship Between Intelligent Agents’ Characteristics and Users’ Intention in A Search Engine by Making Beliefs and Perceived Risks Mediators. Computers In Human Behavior, V. 64, P. 117-125, 2016.
Kwak, Kyu Tae; Lee, Seung Yeop; Ham, Minjeong; Lee, Sang Woo. The Effects of Internet Proliferation on Search Engine and Over-The-Top Service Markets. Telecommunications Policy, P. 102146, 2021.
Roy, Ahana; Meija, Louis; Helling, Paul; Olmsted, Aspen. Automation of Cyberreconnaissance: A Java-Based Open-Source Tool for Information Gathering. In: Icitst - International Conference for Internet Technology and Secured Transactions. P. 424-426, 2017.
Mazurczyk, Wojciech; Caviglione, Luca. Cyber Reconnaissance Techniques. Communications of The Acm, V. 64, N. 3, P. 86-95, 2021.
Kalech, Meir. Cyber-Attack Detection in Scada Systems Using Temporal Pattern Recognition Techniques. Computers & Security, V. 84, P. 225-238, 2019.
Rahman, Md Abdur; Amjad, Mahfida; Ahmed, Byezid; Siddik, Md. Saeed. Analyzing Web Application Vulnerabilities: An Empirical Study on E-Commerce Sector in Bangladesh. Proceedings of The International Conference on Computing Advancements. P. 1-6. 2020.
Pan, Daoxin; Bai, Wei; Zhang, Siyu; Zou, Futai. Detecting Malicious Queries from Search Engine Traf-Fic. In: 2012 8th International Conference on Wireless Communications, Networking and Mobile Computing. Ieee, P. 1-4, 2012.
Quinkert, Florian; Leonhardt, Eduard; Holz, Thorsten. Dorkpot: A Honeypotbased Analysis of Google Dorks. In: Proceedings of The Workshop on Measurements, Attacks, And Defenses for The Web (Madweb ‘19), San Diego, Ca. 2019.
Zhang, Jialong; Notani, Jayant; Gu, Guofei. Characterizing Google Hacking: A First Large-Scale Quantitative Study. Lecture Notes of The Institute for Computer Sciences, Social Informatics and Telecommunications Engineering. Springer, V. 152, P. 602-622, 2015.
Mider, Daniel; Garlicki, Jan; Mincewicz, Wojciech. The Internet Data Collection with The Google Hacking Tool–White, Grey or Black Open-Source Intelligence? Przegląd Bezpieczeństwa Wewnętrznego, V. 11, N. 20, P. 280-300. 2019.
Sun, Shiliang; Luo, Chen; Chen, Junyu. A Review of Natural Language Processing Techniques for Opinion Mining Systems. Information Fusion, V. 36, P. 10-25, 2017.
Noubours, Sandra; Pritzkau, Albert; Schade, Ulrich. Nlp As an Essential Ingredient of Effective Osint Frameworks. In: Military Communications and Information Systems Conference (Mcc), Ieee. P. 1-7, 2014.
Vijayakumar, S.; Sheshadri, K. N. Applications of Artificial Intelligence in Academic Libraries. International Journal of Computer Sciences and Engineering. V. 7, P. 136- 140, 2019.
Zeroual, Imad; Lakhouaja, Abdelhak. Data Science in Light of Natural Language Processing: An Overview. Procedia Computer Science, V. 127, P. 82-91, 2018.
You, Wei; Zong, Peiyuan; Chen, Kai; Wang, Xiaofeng; Liao, Xiaojing; Bian, Pan. Semfuzz: Semantics-Based Automatic Generation of Proof-of-Concept Exploits. In: Proceedings of the 2017 Acm Sigsac Conference on Computer and Communications Security, 2017. P. 2139-2154.
Naveen, Gouda; Naidu, M. Ashish; Rao, B. Thirumala; Radha, K. A Comparative Study on Artificial Intelligence and Expert Systems. International Research Journal of Engineering and Technology (Irjet). P. 1980-1986, 2019.
Mitchell, Tom. Machine Learning. Mcgraw-Hill, 1997.
Abiodun, Oludare Isaac; Jantan, Aman; Omolara, Abiodun Esther; Dada, Kemi Victoria; Mohamed, Nachaat Abdelatif; Arshad, Humaira. State-of-The-Art In Artificial Neural Network Applications: A Survey. Heliyon, V. 4, N. 11, P. E00938, 2018.
Haykin, Simon. Neural Networks: A Comprehensive Foundation. New York: Willey & Sons, 1994.
Haykin, Simon. Redes Neurais - Princípios E Práticas. 2nd Edition, Bookman, Porto Alegre. 2001.
Ferreira, Ricardo Pinto; Martiniano, Andréa; Napolitano, Domingos; Romero, Márcio; Gatto, Dacyr Dante De Oliveira; Farias; Edquel Bueno Prado; Sassi, Renato José. Artificial Neural Network for Websites Classification with Phishing Characteristics. Social Networking, V. 7, P. 97-109, 2018.
Cui, Jie; Wang, Mingjun; Luo, Yonglong; Zhong, Hong. Ddos Detection and Defense Mechanism Based on Cognitive-Inspired Computing in Sdn. Future Generation Computer Systems, V. 97, P. 275-283, 2019.
Kohonen, Teuvo. Self-organized Formation of Topogically Correct Feature Maps, Biological Cybernetics, V. 43, P. 59-69. 1982.
Rico, Ricardo Andrés Pinto; Medina, Martin José Hernández; Hernández, Cristian Camilo Pinzón; López, Daniel orlando Díaz; Ruíz, Juan Carlos Camilo García. Inteligencia De Fuentes Abierta (Osint) Para Operaciones De Ciberseguridad." Aplicación De Osint En Un Contexto Colombiano Y Análisis De Sentimientos". Revista Vínculos, V. 15, N. 2, 2018.
Lee, Seokcheol; Shon, Taeshik. Open-Source Intelligence Base Cyber Threat Inspection Framework for Critical Infrastructures. In: Future Technologies Conference (Ftc). P. 1030-1033. 2016.
Li, Ke; Wen, Hui; Li, Hong; Zhu, Hongsong; Sun, Limin. Security Osif: Toward Automatic Discovery and Analysis of Event Based Cyber Threat Intelligence. Ieee Smartworld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (Smartworld/Scalcom/Uic/Atc/Cbdcom/Iop/Sci). Ieee, 2018. P. 741-747. 2018.

No competing interests reported.

Google Hacking Database Attributes Enrichment and Conversion to Enable the Application of Machine Learning Techniques

Status:

Version 1

Abstract

Figures

1 Introduction

2 Background

2.1 Google Hacking

2.2 Dorks

2.3 Natural Language Processing

2.4 Artificial Neural Networks

2.5 Self-organizing Map

3 Methodology

3.1 Conducting Computational Experiments

4 Presentation And Discussion Of Results

5 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1