Research on an Improved Neural Network Model for Film Text Image Segmentation in Film Internet of Things

In order to solve the problem that Film text is dicult to recognize and dicult to handle in Film Internet of Things, a method that can effectively identify the content in Film text is sought. This paper uses the Mask RCNN algorithm with ResNet101 as the backbone network to establish a Film document image segmentation model.The optimal hyperparameters are: the shape ratio of the anchor frame is [0.5, 1, 3], the threshold for non-maximum suppression is 0.15, and the condence level is 0.85. The F1 score obtained at this time is 0.8951. When these hyperparameters are substituted into the IOU of 0.8, the F1 score is 0.7417. According to the results of the Pattern Recognition Laboratory of the Chinese Academy of Sciences, this algorithm model ranked rst with an IOU of 0.6. Under the premise that IOU is 0.8, it is ranked second, and the rst is a non-end-to-end model with a single task. It can be seen that the adjustment of the hyperparameters and the training of the algorithm model are relatively successful.The experimental results show that the MASK RCNN can accurately identify all the formulas in the Film Text. MASK RCNN is signicantly better at identifying small objects such as formulas in Film Text images than traditional fast cnn and faster cnn.


Introduction
As an effective carrier of information, Film Texts have very important meaning in our daily life. In general, we will refer to everything that is readable on a computer or paper, with text, images, formulas, charts, etc.
as a Film Text. In order to use and manage these Film Text information about a concise and e cient manner, scientists have conducted a lot of research on Film Text processing methods since the 1960s. For example, these Film Texts can be processed into images using a scanner, camera or mobile phone and then imported into a computer and organized into systematic Film Text images that can be stored, managed and transmitted more e ciently and conveniently. There are also many new problems to be solved when extracting Film Text images taken with a digital camera: how to extract the meaning of the text, how to extract the images and formulas and the information of the chart and how to judge the structure of a text image. In response to these problems most of the current traditional methods can only segment and recognize plain text images but this is only a part of the vast amount of information.
The traditional method roughly estimates the edge region of the text by using the Sobel operator and then expands the region by using a morphological method such as expansion and then classi es the separated single characters to obtain the nal text information. The target detection algorithm is the YOLO [1] series of the one-stage method, the SSD [2] series and the RCNN [3] series of the two-stage method. The former is biased toward the operation speed and the latter is biased toward higher accuracy Currently more prosperous is the use of neural network algorithms for text image processing [4][5][6][7][8][9][10][11]..
For the application scene of text image segmentation it is often not necessary to have too high real-time performance so the more mature algorithm MASK R-CNN [11] in RCNN [12][13][14][15][16][17][18] series is used for image segmentation.

Page 3/15
The application scene for text image segmentation often does not need too high real-time performance, so the more mature algorithm MASK R-CNN [19][20][21] in RCNN series is used for image segmentation which has the advantage of high accuracy and simultaneously perform four tasks: instance segmentation, semantic segmentation, target detection, and image classi cation [22][23].
2. Method 2.1 MASK R-CNN algorithm architecture MASK RCNN is an algorithm that can perform Instance Segmentation. It can achieve various complex tasks such as target detection, semantic segmentation, instance segmentation and human pose estimation by adding different branches. It is exible and powerful. It is a representative algorithm at the current stage. The abstract process architecture of the algorithm is shown in Fig. 1: Here, rst input the preprocessed image data to the network,Image data enters FPN (Pyramid Feature ExtractionNetwork) with ResNet101 as the backbone network thereby obtaining the feature map after feature extraction. Then a sliding window is obtained for each point on the extracted feature map to obtain a generated candidate anchor point. These anchor points are combined with the feature map information into the regional recommendation network to obtain scores and correction information for each anchor point. Then since the predicted coordinates are oating point numbers and are not aligned with the feature map so the ROI Align layer is used for adjustment. Finally, different branches are connected to the recommended areas, MASK generation, regression of frame coordinates, and generation of categories are performed. In particular, the introduction of the MASK branch allows the network to perform instance segmentation. Another noteworthy issue is that the network abandoned the original softmax loss function and adopted the sigmod loss function which to some extent avoids competition between the same categories and puts more focus on optimizing the MASK's Pixel results.

Pyramid feature extraction network
The pyramid network is a feature extraction method that improves the accuracy by extracting multi-scale feature information, especially in the detection of small objects. Its essence is a component of ResNet or other backbone neural network that is a common feature extraction network that can be combined with different classic networks to improve the network.
The pyramid feature extraction network structure can be roughly divided into three parts: a top-down deep convolutional neural network, a bottom-up upsampling process and a side-to-side connection between feature maps for information fusion. The detailed process structure is shown in Fig. 2: The top-down part on the left side of the above picture is actually the forward process of the neural network. In the forward process, the feature map changes in size after passing through certain layers, so the layer where the feature size change does not occur is divided into one stage and the feature map size changes ve times in total, so there are a total of In ve stages, the characteristics of each stage output correspond to {C1, C2, C3, C4, C5}, thereby forming a pyramid of features. In general, the residual feature extraction structure of ResNet101 [12] or ResNet 50 [12] is used as the convolutional neural network structure in ve stages.
The bottom-up process uses a bottom-up process using upsampling. The principle is an interpolation method that inserts new points by interpolating between two pixels in order to magnify the original image. The upsampled feature map used has the same size as the previous image.
The role of the side join is to fuse the same level of featureeich from top to bottom and from bottom to top. The role of the side join is to fuse the same level of feature maps from top to bottom and from bottom to top. First, a 1*1 convolution operation is performed on the feature map in the top-down process, the purpose of which is to adjust the number of channels of the feature map so that the feature map has the same number of channels as the upsampled feature map and at the same time Keep the original feature map information. Then, the two left and right feature maps of the same level are merged and the operation is a simple numerical value corresponding addition operation and the merged feature map has richer information. In this process, a total of four merged feature maps {P2, P3, P4, P5} are generated and then the three feature maps are subjected to a 3*3 convolution operation which can eliminate the aliasing effect of the upsampling. Thus, a new feature map {P2, P3, P4, P5} is obtained. In order to extract more detailed overall information, the P5 is downsampled and a new feature map P6 is generated. So far, the feature pyramid network has completed the extraction operations of different scale feature maps {P2, P3, P4, P5, P6}.

Data preprocessing
For the task of Film Text image segmentation, this paper uses the icdar2017 dataset, which is a batch of text image data, which is a dataset used to segment instances of different image types in the text. There are four types of data, respectively, formulas, charts, and other backgrounds. The labeling effect is roughly as shown in Fig. 3.
There are two les in the dataset of icdar2017, which are folders for storing original images and MASK images. There is no le that stores the coordinate information of the border, which requires manual generation of the border coordinates. First select different suitable thresholds and the image is binarized by 0/255 to obtain different types of MASK images at rst. After that, if it is a background MASK, directly generate an external matrix as a calibration frame, otherwise, the MASK image is rst used to generate a set of coordinate points for the instance target contour in the MASK using the contour detection algorithm [13]. Then generate a circumscribed matrix according to the set of coordinate points, that is, the coordinates of the instance calibration frame. And then the MASKs of coall instances are then binarized to 0/1, thus completing the preliminary pre-processing of the data.
Second, the form needed to process the data into an algorithmic model The MASK RCNN algorithm model requires a total of three inputs: the original image, MASK, category code list, and calibration box.
First, you need to perform the dimensioning process for the binary MASKs of different instances in an image and add a dimension at the end and these MASKs are then merged according to the last dimension, which results in a MASK matrix that is one dimension higher than the original, where each pixel is a uniquely One-hot encoding data. For the code list of the category, it only needs to correspond to the order of the categories that are uniquely encoded in the MASK. The calibration box is also merged in the order of the instances in the uniquely encoded pool and merged into an ordered matrix list.

Model framework and environment construction
This paper chooses Tensor ow as the framework of MASK RCNN. It has a combination of lower-level symbolic computing libraries and higher-level network speci cation libraries and is developed and maintained by Google. It has a huge community and therefore has stable support and performance. In addition, it has Tensorboard which is a powerful network model visualization tool. It can support GPU parallel training and support the development of multiple languages such as python and java. The most important thing is that it takes advantage of the method of building static graphs and then recalculating, it makes many calculations that affect computational performance compiled and implemented in C + + and speeds up the calculation Secondly, Anaconda is used to build and manage the python environment and numerous third-party packages. Because Python has a very active and huge community, many extremely useful packages are developed by third parties and the version of various software becomes an extremely cumbersome process. Use Anaconda to build a development environment that automatically adapts the relationships between packages to automatically install third-party packages and greatly simplifying the process of setting up the environment. In this project, the following development packages are mainly required, as shown in the following Table 1:

setting hyperparameters
Most of the deep learning models are an end-to-end network training mode and accompanied by a lot of hyperparameters that requires people to manually adjust. how to set and adjust these hyperparameters is critical to the structure and performance of a model. Therefore, it is necessary to design a reasonable model training strategy to adjust these hyperparameters. In the MASK RCNN algorithm, this paper takes the structure and naming of hyperparameters in the code as an example and lists the following important hyperparameters and their values, as shown in Table 2 below.

Select Optimizer
In the process of training, for faster training data and better non-convex optimization of parameters, this paper chooses the method of Stochastic Gradient Descent as the optimizer and here is a brief introduction to its principle.
Because the random gradient descent is iteratively updated by each sample so it has a large sample size and it only needs a small amount of data to iterate to a relative optimal solution. The rate of random gradient descent is faster than the gradient descent and because the direction of the gradient's descent is constantly changing, it is relatively easy to jump out of the local optimal solution. The solution to the stochastic gradient descent is as follows: Since the loss function is usually minimized, each one is updated in the negative direction of each parameter: This gradient update is complete. Random gradient descent, while having a faster speed, sacri ces a lot of precision and is more susceptible to noise and each optimization is not necessarily a positive optimization.
Over-tting under-tting is always an unavoidable problem during the training of the model, un tting will make the model not learn enough knowledge, However, over-tting can cause the model to form a bias which makes the generalization performance of the model worse. Therefore, it is especially important to set the appropriate training strategy to control the under-tting and over-tting of the model.

model training
First, the data set is divided into a total of about 2400 data in the data set of icdar2017, which is divided into three data sets according to the ratio of 2:1:1. They are a training set with about 1200 data, a validation set with about 600 data and a test set with about 600 data. There is no intersection between the three data sets. The training set is used to calculate the parameters of the neural network and to reduce the gradient, that is, the model draws knowledge from the training set. The distribution of training set data has the greatest impact on the model. The veri cation set is used to verify the performance and rationality of the model, thereby manually adjusting the hyper parameters of a training process, such as the number of model iterations, the threshold selection of non-maximum suppression and and other hyper parameters. Because the hyper parameters are adjusted based on the validation set, the model also learns a portion of the knowledge from the validation set's data, so the distribution of the validation set data has a minor impact on the model. The test set is used after all the steps of the model training are completed and it is used to make an objective evaluation of the model of the nal result produced throughout the process. Because it is completely isolated from the validation set and the training set, the model does not generate any a priori bias on it, which makes it possible to objectively and fairly rely on the performance of a model, especially the generalization performance of the model.

Model Evaluation
In In the eld of target detection, the mean Average Precision is extended based on the accuracy and recall rate. Because in the task of target detection, the model needs to evaluate the classi cation and location of objects. Each image has a different type and location of the target and the metrics used in normal image classi cation cannot be directly applied to the target detection problem, so this evaluation method is based on the method of Intersection over Union. The concept is shown in Figure 5.

Experimental process
To evaluate the performance of a model, its evaluation index is only one aspect. If you really go deep into the model and observe and analyze the performance of each level, you can more clearly and in detail re ect the performance of the model. The MASK RCNN algorithm mainly includes a pyramid feature network, a regional recommendation network, a coordinate area ne-tuning, and a head network. The feature pyramid network is a feature extraction network and the structure is clear and simple. Moreover, the recommendation of the region plays a very important role in the generation of the calibration frame.
Therefore, the analysis results of the regional recommendation network, the coordinate area ne adjustment and the head network are mainly analyzed.
First select an image and make a prediction for this image. The original image of the selected picture is shown in Fig. 6.
The Regional Recommendation Network (RPN) runs a lightweight binary classi er on many boxes (anchors) on the image and returns scores with or without objects. Usually, even a positive anchor cannot completely cover an object. Therefore, the RPN also regresses the re nement (increment of position and size) to be applied to the anchor to shift it and adjust it slightly to the correct boundary of the object. The comparison of the re ned effects is shown in Fig. 7.
Next, enter the interior of the regional recommendation network, run each of the operational graphs and dissect its predictions. Its main computing nodes are RPN network output, pre_nms_anchors and re ned_anchors in the ROI, and re ned_anchors_clipped. The output of the RPN network is mainly to predict whether all classes are in the anchor box and output the forward anchor frame. Take the anchor frame of the rst one of the scores as an example. The output is shown in Fig. 8.
It can be seen that the generated anchor frame has a certain calibration effect but it is still relatively messy. Next, enter the coordinate ne-tuning phase, rst generate the anchor frame that has not undergone non-maximum suppression and its deviation from the modi ed anchor frame, as shown in Figure (9)a below. At the same time, the effect picture after the cut-off to prevent the image from crossing the boundary is displayed, as shown in the following gure (9)b.
As can be seen from Fig. 9(a) above, the modi ed anchor frame is more accurate and performs better. In addition, due to the display of the anchor frame selected according to the score of the top fty,it happens that there is no anchor frame beyond the scope of the image,so the shearing effect of anchor frame in Fig. 9 (b) is not particularly obvious. Next, it is necessary to perform non-maximum suppression on the anchor frame to avoid repeated frame recommendation, the idea is that the anchor frame with the highest IOU threshold is selected as the output with the highest con dence, the output after nonmaximum suppression is shown in Fig. 10.
After the non-maximum suppression, the nal recommended anchor frame of the regional recommendation network is output. It can be seen that the anchor frame has shifted from the repeated focus to a broader focus, that is, attention to all possible objects and its output. The anchor frame is calibrated for a wider range of content, avoiding the neglect of information.
Next, the three branches of the head of the network are entered, that is, the tasks of classi cation, calibration, and MASK generation are combined with the generated anchor frame and original image information. Firstly, the image of the anchor frame calibration is classi ed and the frame is predicted. Each predicted frame correction information is combined with the anchor frame to be corrected again and then the con dence level for each category is generated and then the threshold is ltered according to the con dence threshold to generate the nal frame It is shown in Fig. 11 below.
Here's a step-by-step analysis of the process of border prediction and classi cation. The border effect of this part of the network input is shown in Fig. 12, it is a recommended area that has not been adjusted The input and recommended borders are quite rough and cluttered. Here, the offset value of the frame predicted by the frame prediction network needs to be corrected. The comparison effect of several frames before and after the correction is as shown in Fig. 13.
After the correction, the calibration of the frame is more accurate and has excellent performance. Then, the non-maximum value suppression of the corrected frame is performed again and the corrected borders are prevented from overlapping and a repeated recommended area is formed. The frame after nonmaximum suppression is shown in Fig. 14.
At this point, the resulting prediction frame and classi cation is the nal result of the network and it can be seen that the model has quite good predictive power. In addition, the paper also statistics the predicted deviation values and obtains the deviation of the four predicted calibration frames. As shown in Fig. 15 below.
It can be seen that the degree of correction for the anchor point is small, and the degree of correction of the frame and the height of the frame is large, which is highly likely to be related to the value of the initialization of the hyperparameter (the aspect ratio of the recommended anchor frame), it can be considered to improve the result by changing the aspect ratio of the anchor frame.
The predictive branch of the MASK is a new feature of the MASK RCNN algorithm, which generates a MASK for instance prediction through a full convolutional neural network for pixel-level instance segmentation. For this image, the MASK image generated by the algorithm model is shown in Fig. 16.
Relative to the calibration frame, the MASK can express more views of the network model on the image, such as the focus and ignore points of the network. The partial output of the activation layer of the full convolution network is explained in more detail below. See Fig. 17 below.
It can be seen that the network has a higher response to images such as formulas and relatively low response to larger images. In addition, it also has a certain response to some of the numbers and formulas in the article, which forms a certain noise. In general, the model can produce an excellent activation response for the target you want to predict.

Model evaluation
Combined with the above analysis results, different Super-parameters are used to adjust the model and the evaluation method of F1 score is used to evaluate the model. Finally, the relatively appropriate Superparameters are debugged. In the MASK RCNN, the hyperparameters that have a greater in uence on the model results in the inference phase and have the shape ratio of the anchor frame, the threshold of the non-maximum suppression and the con dence, so that in the case of two IOUs (0.6, 0.8) Multiple experiments to explore the optimization performance of hyper parameters. In the experiment, the results of the hyperparameters when the IOU at the time of evaluation was 0.6 are shown in Table 3 below.
Combined with the above analysis results, different models are used to adjust the model and the F1 score evaluation method is used to evaluate the model and nally the corresponding hyperparameters are debugged. As can be seen from Table 3, the optimal hyper parameter is: the shape ratio of the anchor frame is [0.5, 1, 3], the threshold of non-maximum suppression is 0.15, the con dence is 0.85, and the F1 obtained at this time. The score is 0.7751. Substituting these hyper parameters into the IOU of 0.8 yielded an F1 score of 0.5417. The veri cation diagrams of fast cnn, faster CNN, and MASK RCNN are shown in Fig. 18.
The optimal hyper parameters are: the shape ratio of the anchor frame is [0.5, 1, 3], the threshold for nonmaximum suppression is 0.15, and the con dence level is 0.85. The F1 score obtained at this time is 0.8951. When these hyper parameters are substituted into the IOU of 0.8, the F1 score is 0.7417.
According to the results of the Pattern Recognition Laboratory of the Chinese Academy of Sciences [15], this algorithm model ranked rst with an IOU of 0.6. Under the premise that IOU is 0.8, it is ranked second, and the rst is a non-end-to-end model with a single task. It can be seen that the adjustment of the hyper parameters and the training of the algorithm model are relatively successful.

Results And Discussion
According to the results of the Pattern Recognition Laboratory of the Chinese Academy of Sciences, this algorithm model ranked rst with an IOU of 0.6. Under the premise that IOU is 0.8, it is ranked second, and the rst is a non-end-to-end model with a single task. It can be seen that the adjustment of the hyper parameters and the training of the algorithm model are relatively successful.