Deep Neural Network
As described before, Rural Image Clap generates bulk of rural housing images and partial manual assessment of housing quality from users. Therefore, it provides a precious opportunity to use Deep Learning to automatically and at scale predict housing quality based on these images in rural China. Finally, it reveals the status of rural wealth and development in China.
In brief, Deep Learning could extract the high-dimensional feature from input images using a deep neural network and construct the mapping between features and housing quality using a full-connected layer. Thus, effectively extracting features from images determines the predicting accuracy. Among various deep learning models such as AlexNet27, VGG28, and ResNet29, DenseNet30 gets a striking success in image processing tasks owing to its strong feature extraction ability. Therefore, DenseNet is adopted in rural housing quality prediction in the paper.
The architecture of DenseNet is shown in Fig. 6. It consists of a single convolutional layer, four Dense Blocks, three Transition layers and one full-connected layer in sequence. Specifically, as Fig. 6 shows, the Dense Block comprises several modules with two convolutional kernels of different sizes. Furthermore, these modules are connected by “Dense Connection,” which entirely takes advantage of shallow convolution features to enhance model performance. Otherwise, the Transition layer connects the adjacent Dense Blocks to deliver the extracted features30. Finally, constantly convolved high-dimensional features are input to a full-connected layer to predict the housing quality.
For Deep Neural Network, the predicted accuracy is determined by the model's parameters; therefore, to obtain the optimal value of parameters, the assessed house images and their quality scores are used to train the model to update parameters, namely Back Propagation until the loss function achieves convergence. In this study, we use the loss function of Mean Square Error (MSE) to measure the deviation between predicted and true house quality in each iteration, which is formulated by Eq. 1.
$$MSE=\frac{{\sum }_{j=1}^{10}\frac{{\sum }_{i=1}^{n}({\widehat{y}}_{i}^{j}-{y}_{i}^{j}{)}^{2}}{n}}{10}$$
1
Where, owing to each house image possess multiple quality scores from 1 to 10, thus \({\widehat{y}}_{i}^{j}\) and \({y}_{i}^{j}\) is respectively the predicted and true normalized frequency of score \(j(j\in [\text{1,10}\left]\right)\) of image \(i(i\in [1,n\left]\right)\).
After model training, the performance of the trained DenseNet is evaluated by\({MSE}_{avg}\) and \({R}^{2}\) formulated in Eq. 2 to Eq. 4.
$${\widehat{y}}_{i}^{avg}=\frac{{\sum }_{j=1}^{10}{\widehat{y}}_{i}^{j}\times j}{10}$$
2
$${MSE}_{avg}=\frac{{\sum }_{i=1}^{n}({\widehat{y}}_{i}^{avg}-{y}_{i}^{avg}{)}^{2}}{n}$$
3
$${R}^{2}=1-\frac{\sum _{i=1}^{n}({{\widehat{y}}_{i}^{avg}-{y}_{i}^{avg})}^{2}}{\sum _{i=1}^{n}({{\stackrel{-}{y}}^{avg}-{y}_{i}^{avg})}^{2}}$$
4
Where, \({\widehat{y}}_{i}^{avg}\) and \({y}_{i}^{avg}\)are the predicted and true weighted average of quality scores of images \(i\) respectively; \({\stackrel{-}{y}}^{avg}\) is the average of \({y}_{i}^{avg}\).
In detail, \({R}^{2}\) can examine the fitting degree between dependent and independent variables of the model, where the result of 1 demonstrates a perfect fit, and it means a reliable model for predictions. \({MSE}_{avg}\) elucidates the average deviations between the average of predicted and true value.
Experimental data and set
Based on the Rural Image Clap, we collect 15,700 rural house images after filtering among all shared rural images covering 28 provinces in China. Meanwhile, these house images are assessed manually by at least 15 users. Specifically, (1) users subjectively give a quality score from 1(the worst) to 10(the best) for each house image; (2) all scores of each image are calculated to obtain normalized frequencies. Otherwise, 50,000 raw rural house images covering 10000 villages are collected without manual assessment.
The average score for all assessed 15,700 rural house images is 5.7, while the highest and lowest scores are 8.7 and 3.2, respectively. Besides, the whole housing quality scores follow a normal distribution with a standard deviation of 0.85. Taking 9 rural house images as a typical example, as Fig. 7 displayed, the rural houses with a high floor, luxuriant decoration, and wall-embraced gardens are assessed as high quality. On the contrary, the low-quality houses represent a primitive, fragile and unsafe sense without the ability to resist storm, besides, they rarely possess external leisure space like a garden but face the roads directly.
In deep learning, these 15,700 assessed images are divided into training, validation and test sets with 80%, 20%, and 10%, respectively. In addition, some hyper-parameters of the DenseNet model is set as below: batchsize is 32, epochs are 100, and the learning rate is initially 1×10 − 5 and adaptively adjusts with the decreasing degree of 0.1.