Using machine learning algorithms for predicting real estate values in tourism centers

Along with the development of technology in recent years, artificial intelligence (machine learning) techniques that perform operations, such as learning, classification, association, optimization, and prediction, have started to be used on data on real estate according to the criteria affecting the value. Using artificial intelligence (machine learning) techniques, valuation processes are performed objectively and scientifically. In this study, machine learning techniques were employed to balance the real estate market, affected by the tourism sector in Alanya district of Antalya province, Turkey, and examine changes in value objectively and scientifically. First, the criteria affecting the real estate value were determined as structural and spatial, and data on real estate were obtained from the online real estate website. Then, the values of the real estate in the selected application area were predicted using machine learning algorithms (k-nearest neighbors, random forest, and support vector machines). Unlike studies in the literature, algorithm-based valuation using machine learning algorithms was performed instead of mathematical modeling. When analyzed for performance metrics, the best result was achieved with the support vector machines algorithm (0.73). Objective methods should be used to balance the exorbitant differences between real estate values, to regulate market conditions and to carry out a real estate valuation process free from speculative effects in coastal areas where tourism factor is effective. This study indicated the applicability of algorithm-based machine learning techniques in real estate valuation.


Introduction
Real estate is an expression commonly used for land and structures on it (FIG 1995). Real estate valuation can be explained as determining the possible value of one or more real estate and the rights and benefits related to this real estate on the day of valuation based on objective criteria (Aclar and Cagdas 2008). Real estate valuation is an area that has existed since the valuation of the asset owned, and reflecting the values obtained by valuation in taxes is one of the most important economic resources Timur 2009). The interest in and importance of real estate valuation have increased along with the economic evaluation of development plans, making the real estate sector safe and open, and the monitoring of price changes occurring in the market. Real estate valuation is needed in public applications, such as expropriation, privatization, land and land arrangement, urban transformation, taxation, and registration-based transactions, and individual applications, such as banking, insurance, lending, and purchase and sale (Cete 2008;Erdem 2018).
There is no exact mathematical method to determine the value of the real estate. However, real estate valuation practices performed based on people's intuition and experience during the valuation process require objective and scientific methods (Tanaka and Shibasaki 2001;Yomralioglu et al. 2011). In this context, traditional and statistical methods are used. The results of the values obtained by these methods differ, or there are differences between these results and the market values . It is first necessary to determine the criteria affecting the real estate value and identify the method for objective results in order to prevent these differences. The criteria affecting the real estate value differ in terms of legal, spatial, physical, structural, and environmental features, and these features may vary from person to person in terms of quantity and quality (Eren et al. 1999). Due to the development of technology and problems in the application of traditional methods, new and alternative valuation methods in which these criteria can be evaluated objectively have started to be searched (Ulvi 2018).
Machine learning algorithms, among the artificial intelligence technologies, have begun to be used as alternative valuation methods in the field of real estate valuation. Machine learning is the general name for computer algorithms that model the problem intended to be solved based on the data obtained from the environment of that problem (Dokuz et al. 2020). The purpose of machine learning algorithms is to obtain the highest performance from the model established with the existing dataset and the algorithm used. There are many proposed machine learning algorithms. These algorithms may differ according to their approach to the problem and may have different success in different problems (Pekel 2018). The most common machine learning algorithms are support vector machines, logistic regression, naive Bayes, k-nearest neighbors, random forest, neural network, and decision tree algorithms. We refer the readers to Rajchakit et al. (2021) for theoretical research in the context of neural networks.
Machine learning algorithms provide better performance with respect to statistical and traditional methods for real estate valuation due to their ability to learn from data. When the studies conducted by using machine learning algorithms in real estate valuation in the literature are reviewed, they can be listed as k-nearest neighbors (Pow et al. 2014;Borde et al. 2017;Moosavi 2017), random forest (Pow et al. 2014;Borde et al. 2017;Moosavi 2017;Banerjee and Dutta 2017;Ravikumar 2017), support vector machines (Nas 2011;Wang et al. 2014;Banerjee and Dutta 2017;Ravikumar 2017), artificial neural networks (Ozkan et al. 2007;Nas 2011;Yalpir et al. 2014;Abidoye and Chan 2018;Ulvi and Ozkan 2019), and Naive Bayes (Park and Bae 2015;Grybauskas et al. 2021), and classification and regression tree (Pai and Wang 2020).
Geographic information systems (GIS) are used to visualize real estate valuation practices with maps. It ensures the effective management of data interpretation processes by determining and analyzing the spatial criteria affecting the real estate value and producing alternatives to solve the problems encountered. In the literature, there are many studies using GIS in real estate valuation. The studies on GIS, used to create a database by associating the spatial data of real estate with attribute data (Doner and Alkan 2011), analyze the price changes and the spatial criteria affecting the real estate value (Sesli 2015;Baser et al. 2016;Yalpir et al. 2016;Yagmahan and Gulgen 2018;Ozguven and Erenoglu 2020), and map and visualize the results obtained by different real estate valuation methods (Unel and Yalpir 2019), can be shown as examples.
In this study, changes in real estate values in Mahmutlar neighborhood of Alanya district of Antalya province, a tourism center in Turkey, were examined by machine learning methods. The selected application area is a very important region in terms of determining the criteria under the effect of the sea and tourism and performing precise valuation. The difference of this study from studies in the literature is that algorithm-based valuation using machine learning algorithms was performed instead of mathematical modeling, in addition to the application area. The study first determined the structural and spatial criteria affecting the real estate value. The valuation results were obtained using these criteria and machine learning algorithms, such as support vector machines, random forest, and k-nearest neighbors algorithms. According to the results, the best performance was achieved with the support vector machines algorithm, and the real estate value maps were visualized using GIS technologies. This study showed that real estate valuation could be executed objectively by employing algorithm-based machine learning techniques in a region such as Alanya, where the market balance may change with the tourism factor.
The main contributions of this study are listed as follows; • Machine learning based real estate valuation is performed using spatial and structural features. • Three machine learning algorithms are applied on a real dataset which contains 17 features. • The features of the dataset covers all aspects of real estate valuation. • GIS based visualization of the results of machine learning algorithms are performed.

Study area and data collection
Mahmutlar neighborhood of Alanya district in Antalya province, which is a favorite tourist city in Turkey, was selected as the study area ( Fig. 1). Alanya district is 154 km distant from the city center and has an area of 1598.51 km 2 . This district is surrounded by the Taurus Mountains in the north and the Mediterranean Sea in the south. Alanya district is preferred by many tourists for both vacation and real estate purchases and sales due to its being a coastal tourism region and natural beauty. Mahmutlar neighborhood, located in this district, is a rapidly developing region where valuation needs to be analyzed due to the intensity of real estate purchases and sales. Mahmutlar neighborhood was established on the coastline and has a mountainous structure in its inner parts. In the region, the proximity of the real estate to the sea is one of the most important criteria affecting the market value of the real estate. The mountain view is also an effective criterion for the real estate since the region's inner parts are mountainous. Another criterion that is effective depending on the tourism factor is the social facilities of the real estate. Apart from these criteria affecting the market values of the real estate, the country's economic situation and foreign tourists, whose number increases periodically in the region, are also important criteria affecting the value.
In this study, considering the region's general structure, the criteria affecting the real estate value were discussed as structural and spatial (Ozcan 2019). The structural criteria were determined as the building's age, floor, usage area, number of rooms, the presence of a balcony and elevator, flat facilities and landscape criteria. The spatial criteria were determined as the distance to the sea, education areas, transportation centers, health areas, green areas, religious facilities, cemeteries, shopping centers, and sports areas. A general theoretical framework regarding the criteria is given in Table 1. The cost of the property is also an explanatory variable in the real estate valuation process. However, the property has a special structure in Turkey. In addition, the recent pandemic process and economic changes create difficulties in obtaining this data and accessing it within the property system of Turkey.
In the study, a region covering the Alanya coastline and its immediate surroundings, where there is no contradictory construction in terms of structural and functional use, architectural and landscape changes, and environmental characteristics, has been determined as an application area. A sample of 200 real estates with accessible data on the criteria that reflect this determined region and affect the value of the real estate constitute the database of the study. The data on real estate were obtained from online real estate websites in 2019. The data on the structural features of the real estate were acquired through advertisements on this online site. Table 2 summarizes the structural features of 10 real estate as an example. It is also possible to access data on the location of the real estate from online real estate websites. However, since Mahmutlar neighborhood is a tourism region, the locations of the real estate given by real estate brokers in advertisements generally do not reflect reality. This problem was solved with the help of Google Maps and Alanya Urban Information System, and the real estate's real locations were determined. The locations of the real estate were marked on the available map of the region using ArcGIS 10.6 software with the help of GIS. A database was created by associating the real estate's locations with the attribute data. Using the database created, the distances of the real estate to the spatial criteria affecting their values (sea, education areas, transportation centers, health areas, green areas, religious facilities, cemeteries, shopping centers, and sports areas) were revealed by the distance analyses (Fig. 2).
There are 5 education areas, 29 transportation centers, 2 health areas, 6 green areas, 6 religious facilities, 2 cemeteries, 44 shopping centers, and 3 sports areas in the study area.

Machine learning algorithms
This study analyzed changes in value based on the structural and spatial criteria of the real estate using machine learning algorithms. Three algorithms were selected for this study, i.e., k-nearest neighbors (kNN), random forest (RF), and support vector machines (SVM). These algorithms are generally preferred based on their superior performance over other machine learning algorithms for regression purposes. SVM algorithm is based on mathematical model, while kNN and RF algorithms are based on the dataset and generate data-driven results. Python programming language is used for the utilization of the algorithms. The details of the algorithms are presented below.

Support vector machines algorithm
The main idea of the support vector machine introduced by Boser et al. in 1992 is based on the concept of classification with optimal distance (Boser et al. 1992). Furthermore, the support vector machine uses two basic approaches. The first is that the dimensions start to cut each other perpendicular with the increasing number of dimensions, and, thus, the distinction is made more easily. The second is that there is no need to use the whole model for distinction, and the points close to the boundary between classes are sufficient for it (Yom-Tov 2003). Figure 3 shows the structure of a typical linear support vector machine.
In Fig. 3, w is the weight vector, and the margin indicates the width of the distance between support vectors. It is aimed to perform the classification with the maximum margin and the minimum error. Furthermore, the SVM is also used in regression problems.
In this study, the support vector machine was addressed as a regression problem, and the penalty parameter was used as 1.

K-nearest neighbors algorithm
The kNN algorithm proposed by Cover and Hart in 1967 is a classification algorithm that makes an assignment by considering the class values in the kNN of the object to be  (Cover and Hart 1967). It is also used in regression problems as well as classification. Its ease of applicability, traceability, suitability for parallel operation, and less sensitivity to noise can be listed as the advantages of this algorithm. However, the kNN algorithm also has certain disadvantages, which are its need for a large amount of memory, extreme variability of performance according to the determined k value and the similarity criteria used, and the increase in processing load with an increase in the number of attributes and the dataset size (Bhatia and Author 2010). In the kNN algorithm, similarities and distances between objects are calculated using methods, such as the Minkowski, Euclidean, Manhattan, and Chebyshev methods.
In the present study, the kNN algorithm was used to solve the regression problem, and the k value was 7. The algorithm's best performance was considered while determining the k value.

Random forest algorithm
The RF algorithm, introduced by L. Breiman in 1996, is the main community-based classification method. In the RF algorithm, it can be said that classification is performed by  creating a tree community and assigning the class value of the most popular community to the object (Breiman 2001). There are random vectors responsible for the growth of each tree in the community and randomly selected individuals from the training dataset. It is aimed to determine the feature with the best distinguishing ability in a random feature subset. First, the random pieces taken from the dataset are added to the trees to be created. Then each tree evaluates the test data according to the training process and calculates a class value. Finally, a joint decision is made, and a class value is assigned to the test object according to the most popular decision.
The RF algorithm can be used with both categorical and continuous data. In the current study, the RF algorithm was employed to solve the regression problem, and the number of trees was determined as 100.

Method
The machine learning method, experimental setup, dataset, and preprocessing steps used to reveal changes in value in the area selected in this study are explained.

Experimental setup and its functioning
In the study, first, the dataset was read from the database. Then the data were preprocessed. Afterward, real estate values were predicted using the kNN, SVM, and RF algorithms, and the results were obtained. The prediction process of machine learning algorithms is based on spatial and structural features of dataset. Furthermore, the crossvalidation method with ten-fold cross-validation was used to compare the training performance of the algorithms in the study more efficiently. Figure 4 shows the experimental setup. As seen in the figure, firstly, data is collected from the database. Then, the raw data is performed in the pre-processing stage, the details of which are explained in Sect. 2.3.3. Then, the training and test data are separated. After, the model is created using machine learning algorithms and testing with tenfold cross-validation. Finally, price estimation is performed using the model with test data.

Dataset
The dataset was designed from the structural and spatial data of the real estate in the region selected as the application area. The structural data were the numerical and nominal data expressing eight structural criteria (the building's age, floor, usage area, number of rooms, balcony, elevator, flat facilities, and landscape) in Table 2. The spatial data were the numerical and nominal data expressing the spatial distances of the real estate to the nine spatial criteria affecting their values (sea, education areas, transportation centers, health areas, green areas, religious facilities, cemeteries, shopping centers, and sports areas). The dataset was categorical and consisted of 200 samples and 17 features.

Preprocessing step
In order to achieve better results with machine learning algorithms, the dataset needs to be preprocessed. The preprocessing stage consists of creating the dataset, removing outliers, completing the missing information, editing the data, and scaling, that is, normalizing the data. A homogeneously distributed dataset was created to represent the study area. Since the data obtained from online real estate sites are data entered by real estate agents, there are missing values, information in various formats and incorrect data. Therefore, outliers were not included in the dataset. The dataset is associated with GIS in order to produce real estate value maps and to perform spatial analysis of real estates. Deficiencies in spatial information on online real estate sites were completed using Google Maps and Alanya Urban Information System. At the stage of editing the data, the statements expressed in different forms were arranged to show the same nominal value. Then, the attributes, such as the age range of the buildings were converted into mean values. Thus, a dataset containing nominal and numerical data was created. The final step is scaling the data. At this stage, the normalization process was applied to eliminate the numerical differences between the features and convert them to the same range. The data set values were changed to the range of 0-1 and it was ensured that they affect the results equally.

Performance metrics
The correlation coefficient (R), root mean square error (RMSE), and mean absolute error (MAE) performance metrics were used while comparing the algorithm results (Willmott 1982;Demolli et al. 2019). R shows the relationship between the variables. It is in the range of -1 to ? 1. As the relationship between variables becomes linear, the R value approaches ? 1. If the relationship between the variables becomes negative, then R value is closer to -1. RMSE is a quadratic metric that evaluates the amount of a machine learning model's error. When calculating the error percentage on RMSE, the sum of the differences between the actual value and the predicted value is used. In MAE, the absolute difference sums between the predicted value and the true value are calculated. These performance metrics are widely used and preferred to evaluate the performance of machine learning algorithms.
The formulae of the metrics used are presented below: 3 Results and discussion The values of the real estate in the selected application area were predicted using machine learning algorithms, such as the kNN, RF, and SVM algorithms, presented under the method heading. The correlation between the real estate values predicted by the algorithms and the actual values was evaluated in terms of performance metrics (Table 3). The distribution of the predicted values regarding spatial change was mapped (Figs. 5, 6 and 7).  According to all these figures and results, the most successful result was achieved by the SVM algorithm. Upon examining the results of the kNN and RF algorithms, it can be said that the RF algorithm found the correlation between the attributes better. However, in the prediction values, the kNN algorithm made better predictions on general average than the RF algorithm. Moreover, it is possible to say that the algorithms had difficulty predicting the peak points. Even the SVM algorithm with the best result could not predict many jumps. Figure 8 presents prediction performance of machine learning algorithms of random 20 records from the dataset. As can be seen figure that, the prediction values of the SVM algorithm are closer to the real data compared to the RF and kNN algorithms. However, the reason why the price values peak very high at some points and the algorithms cannot predict prices correctly is due to the abnormal prices determined by some sellers.
Normalization was applied to the values obtained by real estate valuation practices performed using the SVM, kNN, and RF machine learning algorithms. The spatial change and distribution of the normalized real estate values were visualized with maps. As seen in the maps obtained by the SVM, kNN, and RF algorithms, the real estate within a value range of 0.45-1.00 was located in the coastal regions of the study area. In general, it was revealed that the real estate values varied in the 0.05-0.45 value range and the real estate within this value range was located in the inner parts.

Fig. 5 SVM results
Using machine learning algorithms for predicting real estate values in tourism centers 2609 When the spatial distribution of the real estate values obtained by the SVM, kNN, and RF algorithms was investigated, it was observed that the values obtained by the kNN and RF applications were close to each other. While the real estate in the 0.65-1.00 value range was more common in the map obtained with the SVM algorithm, the real estate in the 0.05-0.25 value range was more common in the map obtained with the kNN and RF algorithms.

Conclusion
The most important step in real estate valuation practices is the objective evaluation of the criteria affecting the real estate value with the help of a system. In this study, the criteria affecting the real estate value were determined by reviewing the literature and examining the study area's conditions. The structural criteria related to real estate were obtained from the online real estate website, and the spatial criteria were obtained by distance analysis using GIS technology. It was aimed to obtain precise results by evaluating the criteria determined in the study within a certain system. Thus, artificial intelligence (machine learning) techniques were employed instead of classical methods used in real estate valuation. For this purpose; first, structural and spatial parameters determining the value of real estate were identified, and data on real estate was gathered from an online real estate website. Then, using machine learning techniques (kNN, RF, and SVM), the values of real estate in the selected application region were predicted. In contrast to previous research, an algorithm-based valuation employing machine learning techniques was carried out rather than mathematical modeling.
In this study conducted in Alanya city, one of the most important tourism regions in Turkey, real estate valuation was performed using machine learning algorithms. In the current research conducted on 200 samples using the SVM, kNN, and RF algorithms, the best result was achieved with the SVM algorithm (0.73). The spatial distribution of the values obtained by real estate valuation with machine learning algorithms was examined on the maps generated using GIS technology.
The present study was carried out using machine learning algorithms and GIS to prevent unfavorable market conditions, especially in tourism regions, and perform real estate valuation objectively and scientifically. It is a difficult process to determine the criteria affecting the value of real estate in coastal areas and in this study, the criteria were determined in accordance with the literature. In addition, machine learning algorithms, which is one of the artificial intelligence techniques, were used to regulate the rent, to prevent negative market conditions and to determine the real estate value in a fair and objective way in the regions where the tourism factor is effective. This study contributes to the literature with the use of an algorithm rather than modeling, frequently encountered in real estate valuation. It is a guide for future studies in terms of performing real estate valuation practices on the neighborhood or regional scale and the use of artificial intelligence techniques.

Fig. 7 RF results
Using machine learning algorithms for predicting real estate values in tourism centers 2611 Authors contributions Tansu ALKAN, Aslı BOZDAG , and S. Savaş DURDURAN performed data collecting, visualization, and writing. Yeşim DOKUZ and Alper ECEMIŞ contributed to methodology and writing.
Funding Not applicable.
Availability of data and material Not applicable.
Code availability Software application.

Declarations
Conflict of interests Not applicable.