Geographic data-mining and extraction of association rules using the Apriori algorithm (Case study: Capital of Iranian provinces)

Mahmoud Mahlouji-Bidgholi  Abstract Extracting of association rules between urban features provides latent and considerable information for urban planners about the relationships between urban characteristics and their similarities. For this purpose, in this paper, the most famous and well-known Apriori algorithm is used. We present the Fariori algorithm to delay the characteristics that can be deleted during execution, as well as to achieve main and frequent features in the early stages with efficient changes to the Apriori algorithm. Although the spatial and temporal complexity of both algorithms is exponential based on the number of features, in practice, by implementing the Fariori algorithm in MATLAB, we achieved more rules than the existing software (R, Weka, Market Basket Analysis and, Yarpiz). In the proposed algorithm, it is possible to determine the degree of similarity by adjust-ing the support and confidence ratio parameters to identify a coherent set of similar cities. The used database includes cities of 31 in the provincial capitals of Iran. Discovering the association rules leads to similar cities finding and can be an efficient aid in


Introduction
Recognizing the similarity of cities is a useful method for urban planning that leads to the repetition of positive decisions for similar cities. Association rules make it possible to identify cities by discovering the relationships between urban features and provide a better perspective of them. One of the most popular methods to extract association rules is Apriori. In this paper, we have provided useful modifications to the Apriori method by suggestions to increase the speed and accuracy of it.
We explain the association rules with an example. Suppose city A has features 1 and 2, city B has features 2 and 3, city C has features 1 and 3, and city D has one feature 1. If the rule is 1 ⇒ 3, that is, every city has feature 1, it will probably have feature 3, as well. Among cities, only C has both attributes, but since cities A, C, and D have feature 1, they are likely to have feature 3. It allows urban planners to decide not only based on an existing attribute (feature 1) but also on the possibility of existence feature 3 in cities A, C, and D. Although this also depends on the amount of probability that is called confidence (conf). In this example, since 1 out of 3 cities follows that rule, so the confidence of the rule is 33.4%. It allows urban decision-makers to consider more comprehensive issues and increase the likelihood of success in urban projects by considering the future needs of cities.
Using the proposed Fariori algorithm for data collected from Iranian cities, we have extracted the association rules and also explained how to interpret them in geographical applications by finding hidden patterns. In this paper, based on the 11 main features presented in Tables 1 and 2 The geographical coordinates of the cities are shown in Appendix A, Table 13.

Related Works
In data mining, association rule learning is a popular and well-researched method to discover interesting relationships between variables and identify hidden patterns among items in the big database. The concept of association rules first was presented by Agrawal et al [3]. They introduced association rules to discover patterns among products in the large-scale transaction database. Today, the association rules are very successful in many applications, such as market basket [4], computer networks [5], recommendation systems [6], and healthcare [7]. In the field of geographic information sys-tem (GIS), we can also mention the use of exploring association rules to improve transportation and better access. Appice et al [8] presented some of the association rules that could use new knowledge at the urban level and help direct resources to improve facilities, especially in areas with poor access to transportation. Sengstock et al [9] have used association rules to improve the extraction of concepts from voluntary geographic information (VGI). Kashian et al [10] have used them to analyze and discover meaningful patterns among points registered in a location database such as Open Street Map (OSM).
Some of the well-known algorithms for creating association rules are Apriori, Eclat, and FP-Growth [11]. Apriori is a basic algorithm and breadth-first search (BFS) strategy that has been proposed with different features and functions. One of the changes to the Apriori algorithm is the passing reduction on databases [12]. In another method, sampling is used instead of examining the entire database [13]. Also, the use of parallel algorithms to discover the association rules is proposed and investigated in [14].
Versichele et al [15] have proposed spatial association rules for analyzing tourist attractions. In this paper, it is stated that focusing on different tourist attractions can be strengthened by using association rules. Distributing tourists to diverse areas is better than focusing on a limited one. Goswami and Sah [16] provided a template for exploring frequent patterns of spatial objects. This pattern would be appropriate for a place where the location of things is close to each other. Yum and Kim have provided a combination filtering technique based on customer behaviour patterns and spatial patterns on commercial sites using association rules [17].
Osadchiy et al [18] used association rules to create a dietary intake recommendation system that is resistant to the cold-start problem. Their system builds a model for identifying the similarity of preferences from transactions. Cheng and Cheng [19] extracted the similarity of plant features using the Association Rule-Based Similarity (ARBS) method. Borozanov et al [20] developed an architectural model to search for similar measures and components in enterprise architecture and used the association rules to find interesting elements.
Zhu [21] used the Apriori algorithm to discover the rules of the physical education student database and introduced the lift-measure interest measurement. In this method, by removing some rules and extracting more interesting of them, the complexity of time and space is slightly reduced, but the degree of complexity is like the classic Apriori.
The EAFIM (Efficient Apriori-based Frequent Item-set Mining) method, introduced by Raj et al [22], Simultaneously generates candidates and counts their support values. This method calculates the updated input dataset by eliminating useless items and transactions. Therefore, it reduces the size of input dataset in each iteration.
The FTARM (Fast Top-K Association Rule Miner) method [23] to find the set of top-k association rules using the RGPP (Rule Generation Property Pruning) technique reduces the search space by analyzing the internal relationships between items. This method seeks to find the K important rules, while our method tries to find all the possible rules by showing its computational power because the existing algorithms sometimes cannot find practically all the rules.

Materials and methods
For an item-set ( = { 1 , 2 … }) as a set of n binary features and for a set of transactions ( = { 1 , 2 … }) as a database that each transaction of T in D is identified by a unique transaction code (TID) and is a set of items I, a rule shows how frequently an item-set occurs in a transaction. A rule is defined as XY → that , . X and Y are called antecedent and consequent item-sets, respectively. Usually, a database involves hundreds or thousands of transactions.
To find the existing rules between the cities' features and analyzing geographical data, the most crucial steps are: 1) Creating an accurate database. 2) Using the appropriate software for data analysis. In this study, we attempt to create a precise and small sample of urban database and provide a suitable method to recognize the patterns by the proposed algorithm.
In Table 3, the 11 main features of cities are divided into 57 sub-features according to the amount and type of features. In each row of Table 4, all the sub-features of a city are identified. In Table 5, the presence or absence of each sub-feature is investigated for each city. 1 indicates a city has sub-features, and 0 indicates that the city does not have sub-features. In this table, five features in columns are not present in any cities (Rounded line). So practically, 52 features are checked by the algorithm. Köppen's climate classification is well-known, and detailed descriptions are shown in Appendix A, Table 14.
[ Table 3  Cities are considered as transactions and urban features as items. The unique names of cities are called transaction identifiers. Each column of Tables 1 and 2 can be numeric, binary, character, binomial, and so on. These tables are a pre-processing classification, and the most important principle in the creation of an urban database is observing the principles of classification and determining the precise type of variables.
If a rule is obtained after processing the data in form 4 54 → , it means that a city with an altitude of 1501 to 2000 meters probably has an average annual precipitation of 301 to 600 mm. In the database, the cities with the feature of 4 are Arak, Hamadan, Isfahan, Kerman, Yasuj, and Zanjan. Of these six cities, only Arak, Hamadan, and Yasuj have an average annual precipitation of 301 to 600 mm. So the confidence of this rule is 50%. It means that with a 50% probability, the cities with feature 4 will also have feature 54. Therefore, each rule will have a specific geographical interpretation.

Metrics for accepting association rules
For choosing attractive rules from all possible of them, various restrictions are applied to the metrics, importance of metrics, and attractiveness. The most popular limitations are the minimum threshold for support and confidence. Support of item X indicated as supp(X) is the ratio of the number of transactions that include item X to the total number of transactions. In Table 5, the set of features {3, 6} has the support of 0.387 because they occurred in 38.7% of transactions (12 out of 31 transactions).
where 'fre' denotes the frequency of items, and 'N' is the number of transactions.
In a set of transactions, a rule X→Y has a support value of s if X appears in s% of transactions, and has a confidence value of c if c% of transactions that include X also includes Y. The rules are called strong that have the support and confidence more than the minimum threshold of support (min-supp) and the minimum threshold of confidence (min-conf), respectively. The original purpose of association rule mining algorithms is to discover patterns that represent strong rules.
In most algorithms, including Apriori, finding association rules is a two-step process. In the first step, a set of important items (items that have more support value than minsupp) is identified, and in the second step, this set of items is used to generate strong association rules. It is important to note that the general efficiency of association rule mining algorithms depends on the first stage of the process [24], and the extraction of association rules is easy after finding important items. In addition to support and confidence metrics, other criteria have been proposed to measure the attractiveness of rules.

Apriori Algorithm and new Suggestions
Apriori is a machine learning algorithm (unsupervised learning) that seeks to extract association rules from the transactions database. The minimum support is given to the algorithm as an input parameter to determine the minimum number of items that should appear in an association rule.
The Apriori algorithm uses prior knowledge of the important features of item-sets [11]. A set of k-items is called C k , and a set of k important items is called L k . This algorithm uses an incremental method in which C k is used to obtain C k+1 . Initially, important single-member sets are chosen, which those are L 1 set, and it includes all items that appear in at least a min-supp percentage of transactions. The L 1 set is used to obtain L 2 , and similarly, L k-1 is used to obtain L k . An important feature that helps improve the performance of the algorithm in extracting L k s is that all non-empty subsets of an important item-set are important, too. The validity of this proposition is easily understandable; accordingly, a set of items not possessing the minimum support is considered as unimportant items. Adding extra items to such an item-set does not change its rank to important.
The production steps of L k s are summarized as follows: First, to obtain L k , a set of k-items is produced by combining the members of L k-1 as a candidate that call C k . This combination is performed only if the members of L k-1 have k-2 shared items exactly. The next stage regards the pruning stage. C k is a superset of L k , and all the members of L k are in it, but all members of C k are not necessarily important. A one-pass investigation of the database can identify important members, but a better solution is also available. For this purpose, the Apriori feature is used. That is if any of the subsets of k-1 members belonging to C k are not in L k-1 , then surely C k cannot be an important set. In this case, the database only can be reviewed for some C k members that meet the conditions.
Suppose a set according to Table 6 contains five transactions. First, each item (A, B, C, etc.) is a member of the candidate set (C 1 ), and the algorithm finds all items in the database without considering their repetition (Table 7). If the min-supp threshold is 60% (min-supp = 3), items are selected that are in at least 3 transactions. For each item, the number of iterations in the transactions is specified, so item E is removed because the number of iterations is 2 that is less than min-supp. Therefore, a set of one important item (L 1 ) is easily obtained from C 1 , which includes A, B, C, D, and F. To obtaining L 2 , the algorithm first generates C 2 , which consists of a set of duplicate items. Since items {C, F} and {D, F} are only present in the two transactions, they are removed, and the rest items make up the L2 set.
[ When generating C 3 and C 4 , triple items should not include {C, F}, and {D, F}. To generating C3 using the Apriori algorithm, which states that all subsets of an important set must be important, it can be detected that, for example, two item-sets {B, C, F} and {B, D, F} cannot be important. For {B, C, F} set, it is observed that the subset {C, F} is not a member of L 2 , and subsequently, in the set {B, D, F}, the subset {D, F} is not a member of L 2 . As a consequence, without further analysis, it can be found out that these two sets cannot be important, and their supports' calculation is not required. As mentioned before, all 2-items subsets of an important set {B, C, D} are important, and after calculating the support, these subsets are placed in L 3 . Table 5 is ready to be used in the algorithm, and depending on the chosen software, appropriate storage files have to be created. The sum of 1s in each column indicates the amount of support for that feature. For instance,

The First Suggestion
(1) = 6 31 ⁄ means feature 1 is exists in 6 out of 31 cities. In the next step, the extraction of rules is followed, and finally, strong rules will be extracted by having various minimum confidence.
Since Apriori is a first-breadth search algorithm, to find support for an item, it is necessary to scan the entire database from left to right and top to bottom. To reducing the calculations, it is suggested that another table be created with the number of columns equal to the number of columns in Table 6, and each item be placed separately in itself column. For example, consider column 1 for item A, column 2 for item B, etc. (Table 8). The total number of items in each column is considered as the item support for that column. In the example above, Table 8, like Table 6, will have six columns and five rows. [ Half of the operation is for placing items in the table, and the other half is counting items in all columns. The calculated value for the above example is 60 (5×6×2).
In the Apriori method, all operations to obtain support are calculated using Eq. 3.

Number of operations in the Apriori Total number of items in all transactions Number of items
In the five transactions of the above example, there are a total of 22 items in 6 different types (A, B, C, D, E, and F). For each item, the operations are repeated 22 times, which requires a total of 132 (22×6) actions. Now, in the same five transactions, suppose the number of items is 100, and the number of items in each transaction is also 100. In the proposed method, the number of operations is 1000 (5×100×2), and through the Apriori method, the required operations to calculate the supports are equal to 50000 (500×100), which is 50 times more than the required operations of the proposed method.

The Second Suggestion
In the Apriori algorithm, first, a set of two items is created (step C2), then depending on the amount of support, some items are pruned, and in each step, one item is added to the item-set (i.e., 3-item-set, 4-item-set, etc.). At this step, the proposed method refuses to create a 2-item-set in step C2 and operates intelligently. In our example, the order of the items is arranged based on support from large to small and left to right ( Table 9). The last column (E) is deleted because it is less than the min-supp value. Then, in Table 10, the items in the columns are entered one by one from left to right, and duplicate combinations of items are not included in the table.
[ In the first step, two items B and A are considered, and the support of the compound item-set {B, A} is obtained through Table 8 ( Supp(B, A)=4 ). In the second step, item C is added to the items, and all combinations of these three items are derived without prior iteration, and support values are calculated for each item-set. In this step, we will have two sets of dual items ({A, C}, {B, C}) and a set of triple items ({B, A, C}). Previously, the set of two items {B, A} was examined in the first step and is not repeated now. The amounts of support for each item-set indicates that these are frequent. If a set has less support than min-supp, it will be removed like the Apriori algorithm, and its combinations will not exist in the next steps.
By adding one item in each step, all item combinations are considered that are not similar to the Apriori algorithm. When the itemset is an n-item, all combinations of 2, 3, …, n items are obtained in the same step (n-1). The repetition of single item-sets has previously calculated in Table 9.

The Third Suggestion
We suggest that at each stage, instead of adding one item to produce the compounds, the items should be used that have equal support and have the most support than the others. For example, instead of adding item A to item B, items A, C, D are added because they have Supp=4 , and in the first step, all combinations of these items are obtained. In other words, several items can be combined instead of one at each stage, provided that the added items have equal or more support. This method reduces the number of example steps from 4 to 2 (Table 11).

Association Rules' Learning
As mentioned, the Apriori algorithm does not generate association rules and only creates frequent patterns for generating association rules, which means that after finding a set of important sets of items, the next step is to find strong association rules from this set of items. In this method, the association rules are obtained as follows: • For each important item-set, all non-empty subsets are generated.
• For each subset, if it was important, the rule is added to the set of rules.
Since important sets have already been generated, there is no need to recalculate the support value for them and their subsets.
The time and space complexity are (2 ) i O for the Apriori and Fariori algorithms, where i is the total number of items in the transactional dataset. The worst state occurs only if the threshold support is zero, or the support for each subset (set of unique items) is greater than the threshold (Fig. 2). In the Apriori algorithm, at the first step, it needs to store 1 candidates, in 2nd 2 , in 3rd it will be 3 , till we reach to the last step where it needs to store items. So if we sum up the above terms, it will be , which is (2 ) i O . The important difference between the Apriori algorithm and the Fariori algorithm is that the Fariori algorithm creates important items in the early steps and delays the combination of items with a small amount of support.
Our goal is to creating association rules in the city database, and the implementation of the Fariori algorithm has enabled us to provide more rules.

Discussion
The extracted association rules are valid until a change is made to the database, and if a new transaction set is added, the support and confidence of these rules may change. Therefore, the dataset should be re-examined in terms of strength and robustness. In general, item-sets of the original database and their derived rules, may not be considered important sets in the new database, and vice versa, item-sets that were not previously important may now become important sets. The easiest way to solve the problem is to recalculate important item-sets in the new database, but it is costly. In the Fariori method, the important item-sets of any size are found in each pass of the database. The candidates are created according to the set of important items of the previous step. The proposed method is implemented in the MATLAB program, and with a support value of 0.3 for the urban database, only 73 rules are created, according to Fig. 3. Rule 73 is {25, 49} {11} → and states that cities with features 25 and 49 with a probability of 0.3258 will also have features 11. As shown in Fig. 4, 10 out of 31 cities have features 49 and 25. Therefore, these cities could also have feature 11 with a probability of 0.3258. The algorithm is implemented in such a way that it can differentiate the cities in terms of rules and make city analysis easy and convenient.
This method increases the ability of the software to obtain more rules by delaying removable items. When the three features 25, 11, and 49 are identified as a frequent item-set in some cities, the following rules are obtained: We run our urban database in the "arules" package (Package Authors: Michael Hahsler and Bettina Gruen) in R software. The results are as follows: > data <-as(data3, "itemMatrix") > rules <-apriori(data, parameter = list(minlen=2, supp = 0.5, conf = 0.6, target = "rules")) > summary(rules) there is an implementation of the Apriori algorithm, which we call Market Basket Analysis (MBA), and we run this algorithm for the urban database to get the results for comparison. The corresponding code is shown in Listing 1. Another algorithm used for comparison is the reference [33], which we call Yarpiz.
By keeping the confidence value constant at 0.6 and changing the support value from 0.1 to 0.8, the changes in the number of rules and execution time obtained from R, Weka, MBA, Yarpiz, and Fariori are shown in Table 12.
[ , which is less than min-supp, and it is a problem. Weka calculates the support 0.4, which is 12 out of 31 items and is equal to 12.4 mathematically. This problem arises when it rounds the number of items, so it has a shift in calculations.
• Yarpiz. The Yarpiz algorithm cannot solve the following example.

Conclusion
In the case of cities, we believe that despite geographical differences, some features are similar in cities, and similar cities exist not only in one country but also globally. Finding these similar features will help urban experts in urban planning studies. These findings support decisions related to some urban activities such as providing appropriate strategies, traffic management, optimization of urban energy systems, and even the correct position of city offices, organizations, companies, etc. Today, the association rules learning is used in a variety of geographic applications, and unlike sequence mining, the order of the rules is not considered in both items and transactions.
City-data mining and finding urban dependencies using the Fariori algorithm identifies more rules of city similarities in the urban database and gives urban planners more choice. When the output conditions of the algorithm are: frequent = 6, support = 0.194, feature ='4', the identified cities are Arak, Hamadan, Isfahan, Kerman, Yasuj, and Zanjan. It means that 19.4% of the cities in our database, including these six cities, are at similar altitudes and are about 1501 to 2000 meters high (Feature 4). Under these conditions, one of the rules obtained is as follows: {4}→{54} conf=0.5 lift=1.73 support=0.1 It means that these six cities, with 50% confidence, have an average annual precipitation of 301-600 mm (feature='54'). Thus, through the interpretation of each association rule, we discover the hidden similarities of cities.
When the computer does not have the storage space for Table 5, the Fariori method fails, and this is when the number of features or the number of columns is too large. But in the application of geography and the similarity of cities, where the features are usually not more than 1000, this method is more efficient than the Apriori method.
The preparation of urban databases, along with their timely publication, is our request to geographical institutions. It is better to have a global database of cities to make decisions. For example, knowing what aspects of city X in Iran are similar to city Y in Netherlands and what decisions are currently being made in city Y helps managers make similar decisions for city X. In this way, urban studies become more coherent and accurate. Table 5 can be the basis for implementing genetic algorithms to minimize errors in the similarity of cities.
We will present this algorithm as a package in R software so that urban experts and decision-makers can use it easily.

Appendix A: geographical information
The geographical coordinates of the cities are shown in Table 13, and the Koppen climate classification (required by the article cities) is shown in Table 14.

Items: B, A
Step 2

Items: B, A, C
Step 3

Items: B, A, C, D
Step     Hot Summer a : Thot≥22 * MAP = mean annual precipitation MAT = mean annual temperature Thot = temperature of the hottest month Tcold = temperature of the coldest month Psdry = precipitation of the driest month in summer Pwdry = precipitation of the driest month in winter Pswet = precipitation of the wettest month in summer Pwwet = precipitation of the wettest month in winter Pthreshold = varies according to the following rules (if 70% of MAP occurs in winter then Pthreshold = 2 × MAT, if 70% of MAP occurs in summer then Pthreshold = 2 × MAT + 28 otherwise Pthreshold = 2 × MAT + 14).