Style transfer network for complex multi-stroke text

Neural style transfer has achieved success in many tasks. It is also introduced to text style transfer, which uses a style image to generate transferred images with textures and shapes consistent with the semantic content of the reference image. However, when the text structure is complex, existing methods will encounter problems such as stroke adhesion and unclear text edges. This will affect the aesthetics of the generated image and bring a lot of extra workload to the designers. This paper proposes an improved text style transfer network for complex multi-stroke texts. We use shape-matching GAN as the baseline and perform the following modifications: (1) morphological methods, erosion and dilation, are introduced in image processing; (2) the SN-Resblock is added to the structure network, and a BCEWithLogits loss is added to the texture network; (3) AdaBelief optimizer is adopted to constrain the transfer of text structure. Further, a new dataset of traditional Chinese characters is constructed to train the model. Experimental results show that the proposed method outperforms state-of-the-art methods on both simple characters and complex multi-stroke characters. It is shown that our method increases the readability and aesthetics of the text.


Introduction
As an important part of artistic works, artistic characters are widely used in posters, advertisements and other works. When designing posters or web pages, graphic designers need to enhance the visual effect according to different elements of theme. With the development of deep learning techniques, automatic generation of artistic characters has drawn much attention in the computer vision community. Yang et al. [1] used a highly regular analysis of the spatial distribution of text effects to guide the synthesis process. The algorithm generates artistic typography that is consistent with the local texture patterns and global spatial distribution in the examples. Subsequently, the author proposed a new texture effect transfer network (TET-GAN) [2] to achieve the goals of style transfer and style removal. At the same time, the text effect dataset TE141K [3] was constructed. Different from network models [4] that require paired training sets, Yang et al. proposed Shape-matching GAN [5], which didn't require font pairs for training. The network structure can establish an effective font style map at different deformation levels, and can generate a set of style characters with only one style image. However, when there are many strokes and complex structures in a character, the generated style transfer image may has some problems, as shown in Fig. 1. Zhang et al. [6] proposed a font effect generation model based on pyramid style features. Although the style transfer results on multi-stroke characters are improved, there are also some artifacts, as shown in Fig. 2.
With the in-depth inspection of existing text style transfer methods, we find that there still exists some 1 3 shortcomings, especially in the case of complex font structure. To solve this problem, we propose an improved text style transfer network for complex multi-stroke fonts. We adopt Shape-matching GAN as the baseline network and make appropriate optimizations for dealing with complex multi-stroke fonts. Firstly, we use morphological knowledge to transform font images. Secondly, we add residual modules to the network model and introduce the spectral normalization [7]. Thirdly, we use AdaBelief Optimizer [8] on the whole network structure, in which BCEWith-LogitsLoss is added to enhance generalization. In addition, we build a new dataset with traditional Chinese character images for training. Experimental results show that compared to baseline models, our proposed method achieves a better effect on the style transfer of Chinese characters with both simple and complex structures.
Our main contributions are as follows: • We propose a style transfer network for complex multistroke text that can use any style image to generate text style effects corresponding to its style without paired training sets. • We construct a new traditional Chinese character image dataset for training, which consists of 1416 character images. The font is Hanyi Dahei traditional Character. • We improve the original algorithm. The SN-ResBlock is incorporated into the network which is used to extract deep features. And a new optimizer -AdaBelief Optimizer is adopted to optimize the whole network, in which BCEWithLogitsLoss is added in the loss function to enhance the generalization ability. • Experimental results show that the proposed method can generate better style transfer images than the existing methods in terms of font texture and structure.

Image-to-image translation
Image-to-image translation belongs to image processing methods in the field of computer vision. It mainly uses different styles to render the semantic content of the image and realizes the mapping from the input image to the output image. The emergence of generative adversarial network (GAN) [9] further advances the development of style transfer field. As a powerful deep learning model, GAN performs well in image-to-image style transfer task. As a result, more and more researchers have conducted in-depth research in this field. Pix2pix [10] generates actual images from simple sketches by training paired data, contributing a lot to image reconstruction and image coloring. CycleGAN [11] is more advantageous in the case of many mismatched training data. It allows to transform images from source domain X to target domain Y without training pairs. The method produces better results in photo enhancement, style transfer and so on. Based on CycleGAN, U-GAT-IT [12] presents a new unsupervised image to image transfer method which takes an end-to-end approach. It mainly combines an attention module with a learnable normalization function called AadLIN. U-GAT-IT can flexibly control shape and texture changes through learning of control parameters. LoGANv2 [13] reveals that highquality conditions can embed finer details into the potential space, thus it can generate more diversified outputs. Zhang et al. [14] improve the ghosting artifacts in the background of images with low texture and homogeneous areas.

Neural style transfer
In 2015, Gatys et al. [15] applied VGG19 network [16] to style transfer for the first time and proposed a neural style transformation method. In particular, the texture can be represented by a statistical model of the local features of the image. By constructing the Gram matrix, the style feature representation of any image can be extracted to generate Fig. 1 A failure case of Shape-matching GAN [5]. As can be seen from the figure, when the text structure becomes complicated, the generated image will have problems such as holes, unclear font edges, stroke adhesion and so on, which affect the readability and aesthetics of the text Fig. 2 A failure case of Zhang [6]. There exist some artifacts around the glyph, marked with the yellow boxes high-quality images of different styles. But in a convolutional neural network, different convolutional layers can only be linked between each pixel in the image, and the important semantic information will be lost. Therefore, Li et al. [17] proposed to combine the convolutional neural network with Markov random field, which replaces the Gram matrix in the model. Using the Markov regularization model, the image feature map of the convolutional neural network is segmented into multi-region blocks for matching. Thereby, the visual effect of the composite image is improved. However, the limitation of this method is that the input style image should have a similar shape to the content image. Therefore, the style image cannot be selected freely. So far, after several years of development, neural style transfer has been applied to many fields, such as fashion [18], photography [19], portrait [20], music [21], etc.

Text font style transfer
The style transfer of text fonts refers to the generation of text with the same style as the target style by learning the specified style image. MCGAN [22] generated other letters with the same style according to the existing letters with style. Although the results of style transfer are good, it is limited to the 26 English letters. FTransGAN [23] stated that font styles could be transmitted between different languages by observing a few samples. The authors built a font dataset containing 847 fonts, each of which contains the same style of Chinese and English characters. ZIGAN [24] proposed to use some paired samples from different character styles to enable finegrained associations between different font structures. So it only needs a few target Chinese characters to generate the expected style characters. Yuan et al. [25] proposed an artistic font generation method. However, for the complex Chinese characters with many strokes, this methods may obtain undesired results, such as stroke sticking. Yuan et al. [26] proposed a multi-style transfer network for text style transfer. It can generate multiple styles of text images in one model and control text styles in an easy way. Gantugs [27] proposed a new neural style transfer method which considered both neural style difference and content difference loss. It can generate new fonts by adding or deleting font styles. Wang et al. [28] proposed a novel framework for stylizing texts with exquisite decorations. The core idea is to separate, transfer and reorganize decorative and basic text effects. They also built a dataset consisting of 59,000 professional-style texts.

Method
In this section, we first briefly introduce the baseline Shapematching GAN [5]. Then we present our optimizations and improvements based on this network model in detail to improve the style transfer performance on complex multistroke characters.

Shape-matching GAN
Shape-matching GAN is a new bidirectional shape matching framework. It can build efficient glyph style maps at different inflection levels. A set of styled characters can be generated with just one style map, i.e., no paired font datasets are required. The network structure is mainly divided into two stages, as shown in the Fig. 3.
In the first stage, the sketch module is used to render artistic text under different deformation degrees controlled by a parameter ∈ [0,1], where larger corresponds to greater deformations. In the second stage, it is divided into a structure module and a texture module, which is modeled by generators G S and G T , respectively. The purpose of this stage is to transfer the style of the font structure at first, and then transfer the style of texture. In this way, the shape Fig. 3 The network module of Shape-matching GAN of the style map will be mainly migrated in the G S phase. Then, on this basis, G T will transfer the texture style of the text, and the final generated image will be well improved in the details. The structural network module G S is trained to learn to deform the degree parameter to draw X . In the testing stage, it converts the original structural style of the glyph to the target structural style, obtaining the structural transformation result I X . The texture network module G T renders the structure transformation result I X into texture. In this way, we get the final style transfer image. The formula of the overall process is expressed as

Proposed method
As demonstrated in Sect. 1, current text style transfer methods cannot get satisfactory results, especially when the font structure is complex with multi-strokes. The generated images would have many problems, such as stroke adhesion and unclear text edges, which even make it difficult to recognize the character. To solve the problems, we propose to make some improvements on data processing, structure module and loss functions.

Data processing
Multi-stroke characters have more complex font structures. The style transfer results on these characters are often with sticking strokes. To solve this problem, we propose to introduce morphological operations in preconditioning before training. The morphological operations adopt erosion and dilation. The main function of both of them is to eliminate the noise in the image and separate the independent image elements. In the image processing stage, an erosion operation is firstly performed with a convolution kernel of 2 to the depth map of text image, and then a dilation process is performed with a convolution kernel of 1. Morphological operations are used to transform the contour shape of the font and style, which can improve the appearance of complex multi-stroke fonts. The comparison results are shown in Fig. 11.

Network structure
The network structure of our model is shown in Fig. 4. Shape-matching GAN is adopted as the baseline. We take any pair of style maps and a character image dataset as input to guide the generation of various styles. The network structure is mainly divided into two stages. In Stage I, the sketch module is used to change the style images into different degrees of deformation through the parameter ∈ [0,1]. In Stage II, there are two main parts, structure module ( G S ,D S ) and texture module ( G T ,D T ).
We add the SN-ResBlock to extract deep features into the structure module ( G S ), which consists of eight residual modules that introduce spectral normalization (SN) [29]. In [30,31], the authors show that when SN is applied to generators, it can inhibit the mutation of parameters and gradient values. Moreover, SN can stabilize training by scaling down weight matrices W by their respective largest singular values, effectively restricting the Lipschitz constant of the network to one. Therefore, we apply SN to the residual modules. The improved modules can reduce stroke adhesion and unclear text edge problems and improve the stability of the network model. The formulation of SN is shown in Eq.
(2). The weight matrix W is divided by its spectral norm We use AdaBelief Optimizer [8] as the new optimizer for the whole network structure. Compared with the original Adam optimizer, it has several advantages: (1) fast convergence; (2) good generalization; and (3) stability of training. Therefore, it is more suitable for our improved network model.

Loss function
There are three types of loss functions, adversarial loss ( L adv ), reconstruction loss ( L rec )and custom loss ( L gan T ) in Stage II, as shown in Fig. 4. There are two main parts, structure module ( G S ,D S ) and texture module ( G T ,D T ).
(1) Adversarial loss. It is the most important loss function in GAN. The generator and the discriminator can increase or reduce their own losses to constantly confront each other, thereby the model can generate more effective and high-quality images. We use x to represent the sketch model, which is the binarization of the style image y. x l represents the deformation result obtained from the G B network with different degrees of style structure images. ∈ [0,1] represents the parameter that controls the degree of deformation. The adversarial loss functions on shape and texture are shown in Eq.(4) and Eq.(5).
(2) Reconstruction loss. The reconstruction loss is essentially the L1 loss function, which is also known as Least Absolute Deviation (LAD) or Least Absolute Error (LAE). It minimizes the sum of the absolute differences between the target and estimated values. The robustness of L1 loss function is its greatest advantage. The reconstruction loss functions on shape and texture are shown in Eq. (6) and Eq. (7).
(3) Bivariate cross-entropy with logits loss. This loss function is also called BCEWithLogitsLoss. We propose to add it to the texture module as a custom loss function, denoted as L gan T . It combines a sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain sigmoid followed by a BCELoss as, by combining the operations into one layer, it takes advantage of the log-sum-exp trick for numerical stability. To stabilize the training parameters, we make the loss function converge and enhance the generalization ability. The BCEWithLogitsLoss formula is shown in Eq. (8) .
where N is the batch size in experiment. (x n ) is a sigmoid function that maps x to the interval (0, 1). The formula is shown in Eq. (9) Why don't we add it to the structure module? According to the experimental results, we found that the effect of the generated image became worse after adding this loss in the module, and the gradient explosion occurred in the loss function. This phenomenon may be caused by the addition of residual modules in the module to deepen the network structure. If the gradient of the previous layer is greater than 1, then when the number of layers increases, the final gradient update will increase exponentially, and gradient explosion will occur. On the contrary, if the gradient of the previous layer is less than 1, then with the increase of the number of layers, the obtained gradient update will decay exponentially, and gradient vanishing will occur.
The overall loss functions for G S and G T are as follows, and for all experiments, we set rec S = rec T = 100 and adv S = adv T = gan T = 1.

Experiments
This section presents the experimental results of the proposed method, including the style transfer of simple structured characters, and complex multi-stroke characters. The results of ablation experiments are also shown.

Dataset
We built a new training dataset of traditional Chinese characters. It includes 1416 character images, which were randomly selected from Hanyi Dahei traditional Chinese character set. Traditional Chinese is a type of Chinese character. It generally refers to the Chinese characters that are replaced by simplified characters in the Chinese character simplification movement. For Chinese characters with the same meaning, traditional Chinese characters have more strokes and more complex structures than simplified Chinese characters. Therefore, it is more suitable for experiments on complex multi-stroke characters. Fig. 5 shows a sample image of Hanyi Dahei traditional Character. The size of all images is 320 × 320. Then the style image and text images were converted into depth maps, which were used to indicate the foreground and background. Fig. 6 shows an example style image that has the content of maple leaves and its depth map. The style image and its depth map were both fed into the network.

Experimental results and analysis
To evaluate our proposed network, we perform several experiments and compare the results with that of the Shapematching GAN method, which is taken as a baseline. First, we show the style transfer results on simple characters and complex multi-stroke characters. Then, the ablation experiments on the loss functions and dataset are performed to show the effectiveness of our improvements on the model.

Results on simple characters
In this experiment, the text images from TE141K dataset were used as the test set, represented by Microsoft Yahei. 100 Chinese characters were randomly selected in the system font library for the test. The results demonstrate that Fig. 5 The left image is a sample of the simplified Chinese dataset used in Shape-matching GAN [5]. The right image is a sample of Hanyi Dahei traditional Character

Results on multi-stroke characters
In this experiment, we also used the text images from the same dataset as in Sect. 4.2.1. 100 Chinese characters in the system font library with complex multi-strokes were randomly selected for the test. The comparison results are shown in Figs. 8 and 9. It can be seen that our results outperform that of the baseline method in terms of details. For example, the edge of the text shape is sharper, the texture is more in line with the stylistic image, and the vitality of the maple leaf is more obvious.

The character image dataset
This section shows the effect of our traditional Chinese character dataset. We perform experiments in different settings, as shown in Fig. 10. Figure 10a shows an example result of the baseline model trained on the public dataset. We can see severe stroke sticking, which even affects the recognition of character. Fig. 10b, c are better than Fig. 10a, but it can be seen from the yellow box parts that there are still some problems such as stroke adhesion. Figure 10d is produced by training on our dataset with our optimized model. There is almost no stroke sticking or adhesion problems. The experimental results show that our optimized model trained on our newly constructed dataset produces better results.

The morphological operations in the preprocessing stage
The ablation experiments were also performed on the morphological operations in the image processing stage. As shown in Fig. 11, when using erosion operation in data processing, we can obtain the resulting images with more clear   Fig. 9 The comparison results on multi-stroke characters with another input style image Fig. 10 The ablation experimental results on different training datasets font edges, but the erosion operation also brings the hole problem ( Fig. 11c). The full model after adding dilation (Fig. 11d) can fill the missing holes and maintain the clear strokes. Compared with the baseline (Fig. 11b), our complete model also fills the holes in the strokes and solves the stroke sticking problem. The results show that the morphological operations are effective and yields better results in the style transfer of complex multi-stroke fonts.

The loss functions
This section shows the comparison of iteration diagrams of loss function before and after adding loss function L gan T , as shown in Fig. 12.
Before incorporating L gan T , the loss function L adv G drops to -78.12, and it showed a trend of convergence. After adding the loss function L gan T , the loss function L adv G rises to -11.56, and it has a stable trend. From the comparison of the generated images, we can see that without L gan T (Fig. 13c), the result image still has the problem of unclear text edges. There are indistinct blocks of pixels around the outline of the font, smearing the edges of the strokes. After adding the loss function L gan T , it can be seen in Fig. 13d that the problem of unclear stroke edges is solved, and the rendering effect is clearer.

Conclusions
In this paper, we propose a style transfer method for complex multi-stroke characters, which reduce the problems of stroke adhesion and unclear text edge in the existing methods. Shape-matching GAN is used as the baseline network, and several modifications are made to adapt the characteristics of complex multi-stroke fonts. We also build a new dataset of traditional Chinese characters to train the model.   Fig. 13 The ablation experiments before and after adding loss function L gan T Experimental results show that the proposed model achieve better results in both font structure and texture compared to that of the baseline method. In the future, we plan to further improve the network structure and add an attention mechanism to font textures to reduce the training cost.