Style matching CAPTCHA: match neural transferred styles to thwart intelligent attacks

Completely automated public turing test to tell computers and humans apart (CAPTCHA) is widely used to prevent malicious automated attacks on various online services. Text- and image-CAPTCHAs have shown broader acceptability due to usability and security factors. However, recent progress in deep learning implies that text-CAPTCHAs can easily be exposed to various fraudulent attacks. Thus, image-CAPTCHAs are getting research attention to enhance usability and security. In this work, the neural-style transfer (NST) is adapted for designing an image-CAPTCHA algorithm to enhance security while maintaining human performance. In NST-rendered image-CAPTCHAs, existing methods inquire a user to identify or localize the salient object (e.g., content) which is solvable effortlessly by off-the-shelf intelligent tools. Contrarily, we propose a Style Matching CAPTCHA (SMC) that asks a user to select the style image which is applied in the NST method. A user can solve a random SMC challenge by understanding the semantic correlation between the content and style output as a cue. The performance in solving SMC is evaluated based on the 1368 responses collected from 152 participants through a web-application. The average solving accuracy in three sessions is 95.61%; and the average response time for each challenge per user is 6.52 s, respectively. Likewise, a Smartphone Application (SMC-App) is devised using the proposed method. The average solving accuracy through SMC-App is 96.33%, and the average solving time is 5.13 s. To evaluate the vulnerability of SMC, deep learning-based attack schemes using Convolutional Neural Networks (CNN), such as ResNet-50 and Inception-v3 are simulated. The average accuracy of attacks considering various studies on SMC using ResNet-50 and Inception-v3 is 37%, which is improved over existing methods. Moreover, in-depth security analysis, experimental insights, and comparative studies imply the suitability of the proposed SMC.


Introduction
CAPTCHA: Completely Automated Public Turing Test to Tell Computers and Humans Apart relies on solving the hard Artificial Intelligence (AI) problem [68]. The hardness of a CAPTCHA entails the difficulty of developing an automated algorithm to solve the query with a higher success rate, i.e., a computer program can solve a challenge correctly with a higher probability. User authentication is validated through a random challenge to access the system in this human-machine interaction. On the contrary, an intelligent automated program, aka bot, can crack or bypass the hardness of CAPTCHA and access the system deliberately. Over the decades, several variations of CAPTCHAs (e.g., textual, image, audio, cognitive, adversarial, and visual reasoning) have been developed to thwart different categories 1 3 of bot attacks [1,6,14,29,37,58,64,65,70,70,72,73,78]. State-of-the-art methods have emphasized the security and robustness of underlying algorithms. However, the strengths of CAPTCHA (e.g., text-CAPTCHAs) can easily be undermined by deep neural networks [38,62,71,75,85]. To overcome the limitations of text-CAPTCHAs, image-based CAPTCHAs are considered as a suitable alternative for further security enhancement [64]. Nevertheless, few methods have applied real-time verification, or biometric authentication [4], etc. for liveness detection [6,67] to improve security.
On the other side, style transfer is widely explored by computer vision researchers [33]. The convolutional neural networks (CNN) have shown great success in artistic Neural Style Transfer (NST) between the various classes of content and style images retrospectively [21]. NST is proliferated in diverse and interesting applications such as image super-resolution [17], geometric warping [43], video [53], CAPTCHA [12], image steganography [45], and restorable arbitrary NST [44]. Recently, a framed-based arbitrary video style transfer method is proposed by aligning cross-domain features with input videos leveraging multi-channel correlation [35].
In general, a content image ( I c ) and a style image ( I s ) both are fed into a deep network (N) to produce a stylized image ( I t ) as an outcome of an NST, which is illustrated at top-row of Fig. 1. A research direction is targeted to enhance the controllability in stylization tasks [10], stability [27], robustness [74], computational speed-up [10,34], etc. Another direction investigates the suitability of applying NST in other broader areas [33,61] (Fig. 2).
The discrimination between humans and bots based on CAPTCHA leveraging NST has been explored in recent times [9,12]. Substantially, an NST-based technique uses I t to generate a CAPTCHA challenge Q using an algorithm H. If the given query Q is solved by a human correctly, then the respondent is permitted/accepted, otherwise rejected. The objective remains the same as identifying the object(s) of interest in the query image. The human vision can recognize and solve the challenge despite adversarially perturbed stylized contents, but, it is vulnerable to attacks. Modern optical character recognition (OCR) tools (for text-CAPTCHA), AI-based object detectors (object detection from image-CAPTCHA), and other off-the-shelf sophisticated visionbased tools can undermine the underlying strengths of an NST algorithm easily, shown in Fig. 3a. Even adding more visual difficulties with balancing the usability, and introducing complex patterns or illumination variations cannot hinder the localization of content or main object using the gradient-weighted class activation maps (Grad-CAM) [56], shown in Fig. 3b. In addition, object detection in NST using Faster RCNN [52] (in Fig. 3d) Shows a higher success rate. Top: a styled output image of a butterfly leveraging a standard NST method is shown. Bottom: a schematic user interface posing an SMC challenge. A user requires to match (by clicking on the correct style/pattern) with the input style image (style image grid in the middle) for each of the three rows by understanding the semantic correlation between the input content (Kingfisher) at the left side and the stylized output at the right side To overcome this issue, we propose that instead of recognizing or locating salient object(s) in the stylized image, correct matching with the corresponding style image might be an alternative solution for NST-based CAPTCHA design.
We have generated a Style Matching CAPTCHA (SMC) underlying NST to enhance security. In existing methods, a user is asked to select the style-transferred regions according to a given description [9,12,64]. In contrast, our proposed SMC uses human vision to identify the style I s used in the generic stylization process along with the content I c to render a stylized output I t . A user is requested to select the style I s by visualizing content I c and its textural rendering I t by perceiving the semantic correlation between them. The SMC utilizes human vision in solving a random CAPTCHA challenge Q SMC as a style-matching task. We demonstrate that it could be a difficult problem for the intelligent agents by interchanging the query for style matching, i.e., match with input style/pattern other than the content/object in focus (see Fig. 1). Interestingly, it is easily solvable by human users, whereas it is still a challenging task for automated tools and simulated attacks.
To guard against bot attacks, random source image selection with a broader range of variations (i.e., data augmentation) is followed. Dynamic data augmentation offers additional randomness to design adversarial perturbed content in a stylized representation. In addition, user responses are collected in the same order of appearances of the stylized images according to serial repetition. Correct matching maintains a higher similarity score and correlation between an ordered pair of (I c , I t ) for neural-style matching in SMC. The proposed method is conceptualized in Fig. 4.
According to human perception, this challenge can be effortlessly solved by humans, but, difficult for the bots. Modern computer programs based on deep learning can recover the content from stylized images (Fig. 3a, b). However, restoring an exact style or pattern from the stylized image is tough for current deep architecture with minimal training samples within a stipulated time, according to the best of our knowledge. We emphasize the limited properties and benefits of the current CAPTCHA design using deep networks. The Grad-CAM [56] can focus on the central part of a pattern where a significant degree of variations are involved (Fig. 3c). Thus, it may not be able to match the correct style from a partial pattern. To explore further in this direction, an attack scheme based on deep learning is simulated using the ResNet-50 [28] and Inception-v3 [63] as standard backbone CNNs to verify the security strengths of SMC. The lower accuracy of this attack scheme implies the effectiveness of our proposal. The main contributions of this paper are: -A novel Style Matching CAPTCHA is devised leveraging neural-style transfer. The proposed method matches the styles or patterns to thwart available object detectors, vision-based AI tools, and related bot attacks. -Human performance in solving the SMC challenges through web and Smartphone applications implies improved usability which achieves state-of-the-art performance. -Deep learning-based attack schemes underlying on the standard CNNs fails to achieve satisfactory performance to break an SMC challenge. -Several image pre-processing methods (e.g., denoising) are applied to break the strengths of SMC. A comparative study with text-CAPTCHAs is presented. In-depth security analysis implies the benefits of our SMC. -Comprehensive evaluation implies the robustness of our scheme, reduces the probability of attack, and widens the applicability of the proposed SMC.
The remainder of this paper is organized as follows: Sect. 2 presents a study on CAPTCHAs and NST. Section 3 describes the proposed method, Sect. 4 analyses security aspects, Sect. 5 discusses experimental results. Section 6 states limitations and future work, followed by a conclusion in Sect. 7.

Related works
Several CAPTCHA techniques have been developed since its inception in 2003 [37,68]. Existing schemes can be extensively categorized into the following classes: (a) text, (b) image, (c) audio, (d) video, (e) cognitive, and (f) miscellaneous. Among these, text-CAPTCHAs are broadly studied [46]. Recently, a text-based CAPTCHA technique based on the Hindi language that concurrently uses printed and handwritten Hindi characters is presented [38,40]. In this work, k-nearest neighbors, support vector machines, and random forest classifiers are used to break ten distinct colored CAPTCHAs. In addition, recognition of hollow Hindi characters in text-CAPTCHAs is described and achieved good performance to recognize distorted and multi-scaled hollow characters [36]. However, modern optical character recognition (OCR) technology can breach text-CAPTCHAs and compromise security [64]. Remarkable success has been attained to break monochrome Devanagari CAPTCHA schemes using several classifiers, such as Random Forest in [39]. In-depth analysis of breaking colored Hindi CAPTCHAs is studied [38]. Recently, text-based NST for complex multi-stroke texts is explored [8]. Audio CAPTCHAs are also getting research attention. A generative adversarial network (GAN)-based method is developed for audio CAPTCHAs to improve security by generating adversarial perturbations [70]. Visual reasoning based on commonsense knowledge is exploited in CsCAPTCHA, to improve security and usability [72]. Several new CAPTCHA design strategies are developed using deep learning in recent times.
On the contrary, image-CAPTCHAs are used to distinguish human and malicious bots using various vision-based schemes, such as object detection, target recognition, and scene understanding. These schemes are deployable on various touch devices, and smartphones with better convenience and usability [83]. Thus, we have studied image-CAPTCHAs and a concise study is presented in Table 2.

Generic image-based CAPTCHAs
The naming CAPTCHA, distinguishing CAPTCHA, and anomaly CAPTCHA [80] are the initial image-CAPTCHAs. The users require to find the similarities or dissimilarities in a set of images. However, these schemes suffer from misspelling, mislabelling from users, synonym words, and polysemy words. To solve Implicit CAPTCHA [3], users need to (a) Object detection in content/styled image using AI-tool (b) Class activation maps on generic NST (c) Class activation maps on style/pattern only (d) Object Detection with bounding-box regression using Faster RCNN. Fig. 3 a Success of available AI-based vision tools for object detection (surrounded with green bounding-boxes) from images which leads to a major limitation of current NST-based CAPTCHA by localizing the salient object(s) in the challenge. b The addition of random adversarial noise/complex distortion, and transformation/ data augmentation have a futile effect of identifying the key content/ object using gradient-weighted class activation maps (Grad-CAM). c The Grad-CAMs are centralized in the style/pattern images. d The content-object detection success is higher using Faster RCNN [52]. Best viewed in color tap on the ideal position or a particular word on an image. Collage CAPTCHA [59] is developed by combining various objects into a single image, and the users are asked to choose specific objects. It necessitates a database of tagged object images, which are dispersed and rotated arbitrarily in the background. Collage CAPTCHA is often attacked by using object segmentation and recognition-based methods. ARTIFACIAL [54] introduces a complex 2D facial model that embraces human face recognition capability in the challenge. However, ARTIFACIAL can also be cracked easily, as described in [84]. ASSIRA [18] uses a grid to represent images of cats and dogs, and users must recognize the cats. Multiple-choice questions are used by ASSIRA to expand the solution space and thus improve security. Golle et al. [23] use image recognition to distinguish cats and dogs in ASSIRA by integrating color and textural features to undermine the scheme. IMAGINATION [16] exploits people's imagination capability by allowing them to interpret images from a distorted and cluttered background. It comprises two subsequent steps: click and annotation. Still, it is vulnerable to attack, as specified in [84]. A significant breakthrough is Google's reCAPTCHA [69]. It is ideated with a virtual checkbox, and interestingly it does not require any text, image, audio, or video data to pass the challenge. The latest version of Google's No CAPTCHA reCAPTCHA [57] increases usability. However, deep-learning models can solve Google's No CAPTCHA reCAPTCHA [62]. Polakis et al. [49] present a scheme using an image selection and modification mechanism based on Facebook's social authentication to boost security. Human face photos that are indistinct, obstructed, or from the rear side are used as features in this technique to protect against malicious bot efforts. Users can recognize their friends from these avatars based on prior information. Several CAPTCHAs based on human faces, such as FR-CAPTCHA [25] and FaceDCAPTCHA [24], have also been tested. In both schemes, the face images are blended over a complex background followed by various geometric transformations (e.g., rotation) and noise. FP-CAPTCHA [51] challenges the users to click on human facial points such as the eye, nose, and mouth, which are laid over a cluttered background and additional noises. Hand-CAPTCHA is implemented using a randomized combination of two real and five to seven fake hand images [5]. In addition to hand biometric verification, liveness detection is added to the verification pipeline to improve security [6]. Vessel CAPTCHA [15] targets 3D brain vessel segmentation. It divides the image into 2D patches and users identify the patches containing a vessel or its part. Most of these techniques are based on object detection or localization which are vulnerable to attacks. Jia et al. [31] proposed a novel image-text-based model for creating CAPTCHA that is based on cognitive processes and semantic reasoning. In order to create a multi-conditional CAPTCHA that can resist the attack of CNN's classification, this technique combines three features: sentence, object, and location.

Mobile device-based CAPTCHAs
There has been a lot of research done on mobile-based CAPTCHA schemes [7,66,77], and a few image-based examples are included here.
Noise CAPTCHA applies two noisy images of varying sizes and a concealed object or message at a precise location in the image [47]. To pass a CAPTCHA test, participants require to drag the noisy image over a large image until the hidden item is visible, followed by a submit button. An orientation sensor-based CAPTCHA scheme named Sen-CAPTCHA is described in [19]. It asks users to tilt their phones to direct a colored ball toward the center of an animal's eye after displaying an image of that animal on the screen. TapCAPTCHA [2] is based on audio and gesture interaction with smartphone devices, and assessed its usage by visually challenged people. In addition, TapCAPTCHA is compared with audio CAPTCHA based on efficiency, accuracy, user satisfaction, and workload. In augmented Reality CAPTCHA [32], a user is asked to prompt with a specific marker in a 3D physical environment. As the mobile device rotates, the CAPTCHA's position changes. Once the CAPTCHA shape is detected, a user needs to spin the mobile device for angular alternation. Annuli-CAPTCHA [81] uses overlapping of annuli which consist of circles and ovals as geometrical shapes. Users are asked to enter the correct number of circles and ovals in the query for the solution.

Neural style transfer (NST)-based CAPTCHAs
Deep learning-based DeepCAPTCHA [48] applies adversarial noise, and strengthens security by defending against common image processing attacks. Recently, NST has been adapted in various image-CAPTCHAs. SACAPTCHA [64] generates a synthetic image, by transferring different shapes of various styles. Users are instructed to click on foreground style-transferred regions based on a brief description to solve the challenge. On the contrary, an attack-model using mask RCNN to determine various shapes which are originally applied to improve the resilience of SACAPTCHA is described in [50]. The experimental result (i.e., maximum 96% F1 score) is enough to recognize an object/shape provided in the challenge. The results imply SACAPTCHA is also vulnerable to the mask RCNN-based attacking scheme. In another direction, Generative Adversarial Networks (GAN)-based end-to-end text-CAPTCHA cracking technique is proposed in [42]. It follows cycle-GAN-based synthesizers to create a large number of synthetic CAPTCHA examples for training in addition to active transfer learning. It achieves 97.6% highest success to break real-world CAPTCHAs from various websites.
In Grid-CAPTCHA [12], users should choose one out of nine stylized images according to a brief scene description. In addition, the scheme employs the same style to convert all of the images to stylized versions. To baffle recent CNNs, StyleCAPTCHA [9] has been proposed. NST blends human face images with reference styles to produce stylized face images. The challenge involves classifying ten stylized images into either human faces or animal faces. Inspired by these works, we present a new image-CAPTCHA algorithm to strengthen security with maintaining human performance.

Proposed methodology
The Style Matching CAPTCHA (SMC) generates a random challenge Q underlying on NST, which requires three main components: a content image ( I c ), a style image ( I s ), and a stylized image ( I t ), as defined earlier (Table 1). Table 1 contains all the symbols that are used in this article. All the symbols are sorted according to their names. In addition, the symbols are described whenever they are used in this article.

Style matching CAPTCHA (SMC): overview
A vital target of a CAPTCHA algorithm H is to maximize the recognition gap ( ) between the success rates of fraction of the human population ( ), and an intelligent agent ( ), i.e., | − | = ≥ ; with > 0 , and ideally ≈ 1 . To maximize the margin of , serial repetition of the randomized challenge Q should be solved for times within a stipulated time-interval for each answer. According to [68], these parameters U = { , , , } are essential to define H as a hard AI problem. The verification of a user by solving Q for times ( time for each answer) in a sequential repetitive manner, is denoted as Q = ∏ i=1 Q i . As the answers are Boolean (yes/no), a user is permitted, if Q = 1 .  Considering all the prerequisite parameters, we define a generic Q = H(I t , U) . As U represents the cognitive constraints, thus, from the algorithmic design perspective, it can be simplified as Q ≈ H(I t ): where I c , I s , and I t represent the content image, style image, and stylized images respectively. N is the deep network, H represents the algorithm for creating Q SMC , and U is the cognitive constraint. Q represents a random challenge, and Q is the user verification by solving Q for numbers of times. Now, we define SMC specifically as Q SMC from a generic Q . Our approach chooses I s from a set of random samples provided in a grid structure (Fig. 4). By observing I c and output I t , a user has to select the appropriate style/pattern I s . From user's perspective, the challenge is posed as where Q SMC represents the challenge of SMC, H represents the algorithm for creating Q SMC , and I c , I s , and I t represent the content image, style image, and stylized image, respectively. Considering the key components, from the designer's view: (1) where is a random function that selects images at random from the database (D) and M is a mapping function that places the styled or stylized images in the image grid.
A random function selects the I c and I s (from database D) with which a style-transferred image I t is generated using a deep network N to implement NST. Database D contains a set of content-images ( I c ), denoted as C = {I c } ; and a set of style images ( I s ), denoted as S = {I s } . It can be noted that any standard deep network can be defined as where f is the activation function, W i is weight, X i is input, and b is the bias in the i th layer. Here, we denote N for simplicity. In addition, a few more random styles which are not used in NST but are selected for filling the remaining empty places in the style-grid (SG), denoted as fake styles Ĩ s . A challenging Q SMC is generated with all of these images by positioning the actual ( I s ) and wrong ( Ĩ s ) styles in the grid SG arbitrarily. A function M maintains the correct pair-wise mapping between ( I s , I t ) with respect to the reference content I c . Now, the user needs to match the correct one-to-one correspondence between ( I s , I t ). For a given I c , a series of valid matching of number of I t with the respective I s is considered as a correct solution of a given query. It is defined as where each I i t is produced using an ordered pair of {(I c , I i s )} i=1 inputs to the N, respectively. It can also be stated as The user is asked to select (click on the SMC web page or finger-touch on SMC-App) the input style images for correct matching based on the rendered stylized and content images as the cue, according to a specific order, shown in the middle. The correct and wrong matching scheme is illustrated on the right side where i ∈ [1, ] , is the number of times a user has been verified by solving Q SMC , I c , I s , and I t represent the content image, style image, and stylized image, and N is the deep neural network. Conversely, deep networks use non-linear activation between the layers, random hyperparameters, and random weight distributions in the learning process by optimizing the loss function. Thus, it is hard for a system to replicate the same model-output by guessing the model-parameters. As a result, it adds more security to solving a challenge by an automated program within a given time limit.
SMC Algorithm: Our Algorithm-1 produces a random challenge Q SMC , shown in Figs. 1 and 4. Initially, style-grid SG and stylized-grid TG are considered as two empty imagegrids. A function is used to select a random content ( I c ) chosen from C (Content images from Database D). Next, three random styles ( I s ) are selected from S which are placed on the m × n style-grid SG, one in each row using a function M to maintain the correspondence with the actual/real style and its respective stylized rendering I t in TG. To adhere randomness in the positioning of an actual style, a random natural number p is generated as an index which is used for placing a real I s in a row. The indexes of remaining freeplaces in the same row are stored in a set . Each row's style placement task is altered according to the indexes stored in p and .
Next, a deep neural-style transfer NST method is utilized to produce stylized I t and place them in the respective column of TG. The remaining empty positions in SG are filled with counterfeit/fake styles ( Ĩ s ) according to using M. This process is serially repeated for each row (here, = m = 3 ), and it is scalable to a higher row/column value. These classes of images with their corresponding grid representations ( I c , SG, TG ) are juxtaposed in a single frame as algorithmic output H(I c , SG, TG) . Finally, a user-friendly interface (i.e., SMC-App or web-application) presents this random challenge Q SMC to the user for solving. Particularly, image pre-processing techniques are applied for basic structural representation of SMC. First, a noisy background is created, over which a 3 × 3 grid for style images and a 1 × 3 grid for stylized images are generated. Next, 3 stylized images are chosen and placed randomly on the stylized-grid (TG). Next, 3 corresponding style images are selected and placed randomly on style-grid (SG) using a mapping function (M). Likewise, function M is used to place fake style images on the remaining 6 positions in the SG grid. In addition, random rotation and scaling are applied to the style images at the processing stage. To remove additional black pixels appearing at image-boundary regions due to rotation, an alpha channel is added to convert those pixels transparent. Finally, the resized style images are placed over the SG grid.
The Gram matrix (6) in NST (Sect. 3.2) learns the feature map distribution within a layer. The style-loss L style (8) improves the matching rate between the feature map distributions of I s and I t in a layer. It is obvious that N optimizes two different loss functions L content (5) and L style (8), for two different inputs I c and I s , respectively. Therefore, it does not produce the same output I t , if we interchange the roles of input images and vice versa, i.e., I t 1 = I c,s and I t 2 = I s,c . The results I c,s and I s,c are significantly different for two cases as Clearly, our objective is to generate Q SMC as a CAPTCHA challenge based on a blended image derived by NST. To add more insights, we have tested the significance of interchanging the roles of content and style inputs. However, this input alternation strategy leads to a more challenging situation for the users in solving a random Q SMC easily.
Here, the outcomes both possibilities i.e., I t 1 and I t 2 are shown in Fig. 5. However, we prefer I t 1 according to the structural similarity index (SSIM), as I t 1 offers a balance between usability and security, evident from Fig. 5.
During verification, a human participant requires careful observation of the content and stylized output images to pass the challenge. The user is requested to select the corresponding styles from the style-grid, one per row. Humans can easily detect the patterns where automatic programs or bots cannot perform the task with rigorous attempts. If a user selects three styles/patterns correctly, it is considered a successful solution of Q SMC , otherwise not (Algorithm-1).

Neural style transfer (NST)
We have revisited NST [21,34] to design the proposed SMC. A deep network (N) applies a non-linear filter bank at various layers to generate feature maps. N computes feature maps F l at layer l from an I c with a dimension of H l × W l × D l , height is H l , width is W l , and the number of channels is D l at the l th layer. The feature space is represented as F l i,j ∈ ℝ H l ×W l ×D l , and F l i,j is the activation of i th filter at j th location in layer l of N. Consider F l be the actual feature map of I c , and F l is the feature map of rendered stylized I t at layer l. The loss function (squared Euclidean norm) between these feature maps is The higher layers of N infer high-level content information and the object's appearance in the feature maps. These higher layers are apposite for representing content summary over the style/texture information. To learn the textural pattern in the layers, feature correlation is an effective measure that is computed using the inner product of the feature maps (i.e., i th filter at the j th location) at layer l. This correlation is represented in the gram matrix, defined as where F l ik , and F l jk represent feature spaces at l th layer. The stylization process at layer l can be optimized using the gram matrix G l ij of corresponding source style ( I s ), and the gram matrix Ĝ l ij of rendered styled ( I t ), respectively: is MSE loss between the gram matrix of the style image and the stylized image, I s , and I t represent the style image and stylized image, H l is the height, W l is the width and D l is the number of channels at the l th layer. The total stylization loss, including all the layers, is given as where w l are the weights at layer l and L is the number of layers in N, I s , and I t represent the style image and stylized image. Finally, these two loss functions are linearly combined and jointly optimized to minimize the error. In addition, to enrich spatial smoothness, total variation loss L tv [34] is used. It is the sum of the absolute differences between the neighborhood pixels, denoted as px i,j and px i+1,j+1 : Combining these three losses, the joint-loss function is are hyperparameters that estimate a tradeoff between the content and style in the rendering process. L content is total content loss from Eq. (5), L style is total styleloss from Eq. (8), L tv is total variation loss, Eq. (9).
Here, we consider a random variation in their values such that It offers additional randomness in the perturbation (Fig. 6) to hinder malicious bots. The VGG-19 [55,60] pre-trained on ImageNet, is adapted for implementing SMC. The stylization process with the loss values at four intermediate iterations within 100-2000 is shown in Fig. 8. All the experiments are conducted in Tensorflow 2.x using Python 3.7 scripts, and Google Colab GPU environment is used for deep-learning experiments simulation.

Hyperparameters
The total loss is obtained from Eq. (10), where 1 , 2 , and 3 are the hyperparameters that estimate trade-off between the content and style in the rendering process. With this variation, we can control the amount of style that would be present in the stylized output image. Figure 6 illustrates this process with 4 types of variations with the hyperparameters. It is quite natural that if we reduce the style component, the attack accuracy will also degrade as well, as it would be a tough job for the users also. Therefore, we have maintained a fair balance between style and content images to produce the stylized output. However, we have generated a lot of stylized output with reduced style content and performed a Type-III attack simulation. In this evaluation, only 35% accuracy is achieved that shows the chances of an attack are minimal.

Dataset description
Element-Based Textures Dataset (ElBa) [22] consists of procedurally generated realistic images with variations in shapes, colors, etc. It includes 30k texture images with various levels of local symmetry, stationarity, and density of (3 M) localized texels. Element-based textures are a type of texture made up of texels, which are named elements that are dispersed according to statistical distributions. The textile, fashion, and interior design industries are the most common users of this dataset. Texel-Att is frequently utilized since current texture descriptors fail to correctly define elementbased texture. Texel-Att is a framework for representing and classifying element-based textures that is fine-grained and attribute-based. In our experiment, the ElBa dataset is used as style images. In addition, 2000 style and pattern images are collected from other resources such as Kaggle's Abstract Art Gallery. 1 The content images are collected mainly from the Kaggle repository. 2 Our dataset consists of high-quality images of more than 100 object categories such as human faces, animals, birds, and flowers. Around 2000 fine-grained content images are collected and stored in Database. Finally, with style image and content image, we successfully created around 3000 stylized images. All the images are resized to 512 × 512 pixels and stored in our database. Samples of these image categories are shown in Fig. 7.

Security analysis
An important attribute of a CAPTCHA is its resiliency to various malicious attacks. Here, the security benefits of SMC over various attacks are described.

General deep-learning attack
Image-CAPTCHAs are vulnerable to deep-learning attacks. It is possible for CNNs to extract and recognize the content information from a stylized image. However, to find style or pattern information from the stylized image, it is tough to identify the textured pixels using modern AI tools or CNNs. To explore further in this direction, we have simulated several attack schemes by considering various datasets, assuming that an attacker has collected some random samples of stylized and style images which are used in SMC challenges at various sessions. Next, a thorough systematic cropping technique is applied to these samples to generate more sub-samples for training a CNN for recognition. From each stylized image, 150 sub-samples are generated, and an 80:20 ratio is followed for training (80%) and testing (20%) using the ResNet-50 [28] and Inception-v3 [63] as standard CNN backbones. Our objective is to evaluate the accuracy of a deep model to classify the required styles using a few samples with less manual supervision. This simple classification indicates the underlying strengths of SMC to thwart deep-learning attacks. We have created 4 types of threat models and related datasets in each case.
-Type-I: Train and test both with stylized image samples.
-Type-II: Train and test both with style image samples.
-Type-III: Train with stylized and test with style samples.
-Type-IV: Train with style and test with stylized samples.
Here, Type-I and Type-II perform like a general supervised classification task with similar categorical variables. Type-III and IV are more challenging threat models as the training set provides simple pattern information but testing data is   ResNet-50 and Inception-v3 are used as CNN backbones and are trained for 50 epochs with a batch size of 8. The Stochastic gradient descent (SGD) optimizer with 0.01 as a learning rate is used for training. The results are given in Table 3. The objective is to verify whether available CNN can be used for classifying the stylized outputs which are considered Type-I deep learning attack methods on SMC. The classification accuracy on these styled samples using Inception-v3 is 96.47%, and ResNet-50 is 85.01% (Table 3) with 256 × 256 resolution. Similarly, corresponding style images which produce the stylized images are classified using the same CNNs for Type-II attacks. The accuracy using Inception-v3 is 97.54%, and ResNet-50 is 85.68%. The CNNs can classify the stylized and style images with high accuracy and precision. For the entire process of neural-style transfer operations and various attack analyses, we have used the Google Colaboratory.

Attacks at latent layers
We have delved into style-content blending mechanism at mid-level convolutional layers of VGG-19 in NST architecture by exploring the gram-matrix formulation and loss functions. The mid-level layer summarizes a latent representation of compact feature space. It is widely used in encoder-decoder architecture, GAN-based model representations [30], etc. The mid-level blocks learn latent-features that are not easily interpretable into object/style categories from such bottleneck layer(s). Hence, feature learning is very difficult when the model-and hyperparameters are intrinsically highly random initially, and are further optimized during training. However, current state-of-the-art CNNs are very powerful to crack the underlying strengths of NST easily. In our experimental study, the highest performance of latent-layer based attack simulation is about 45% which is reasonable and much lower than other existing works such as SACAPTCHA [64]: 82%, CAPTCHaStar [13]: 96%, and others. Hence, our proposed SMC can improve the strengths to a significant extent (more than double ) than other works. A comparative analysis is presented in Table 13.
Our assumption might be relevant for an attack at some latent/intermediate convolutional layer/block of CNNs, shown in Fig. 9. We have assumed that an intruder can access the mid-level latent-style feature representation of block3 convolutional layers of VGG-19. Particularly, 'block3_conv1' for style image and 'block3_conv2' for content image are accessible to render a stylized output. Whereas, the actual style is learned and transferred through all CNN blocks in the main model. Accordingly, we have defined an attack-loss function based on the latent-layer as where l attack implies the latent-layer under attack, w l are the wights at layer l, E l (I s , I t ) is the MSE loss between the gram matrix of style image and stylized image, I s and I t represent the style image and stylized image respectively. The total attack-loss is defined as where denotes the hyperparameters in model simulation at the attacker's side. The difference between actual style-loss and attack style-loss is added as a penalty-term in error estimation. Generally, it is used as L1 norm for regularization to generalize learning tasks in CNNs. Now, simplifying the original joint-loss function (Eq. 10) by ignoring total variation loss (which is actually used as a variational regularizer for spatial smoothness in [34]) to relate with style attack-loss function: (11) L attack = L attack style (I s , I t , l attack ) = w l E l (I s , I t ) For a more realistic attack simulation, the parameters of NST at the designer's side as well as at the attacker's end should be almost identical, i.e., ≈ ≈ 1 . Because, we have simplified and approximated Eqs. (10) and (13), as ≈ 1 . Hence, the total attack-loss can be simplified from Eq. (12) as where denotes the hyperparameters in model simulation at the attacker's side and is the hyperparameters that estimate a trade-off between the content and style in the rendering process at the designer side. denotes a marginal difference ( -) between hyperparameters. The objective for a latent-layer-based attack should incorporate an efficient optimization of all types of model parameters and regularization such that = 0 , yielding L attack total = L original NST . However, (set of actual hyperparameters in NST) and (set of adaptive attack-model's hyperparameters) are different parameters, very sensitive, and random by nature. Hence, the attack-model should be very sophisticated and efficient for achieving excellent attack success which is obtained in text-CAPTCHAs and related other NST-methods easily. Mean square error (MSE) estimates error in actual NST outcome with adapted latent-layer output, and the results are given in Figs. 10 and 11.
The Visual Geometry Group (VGG)-19 networks receive both style and content inputs, and each image's feature representation is separately derived from a distinct layer. We can obtain distinct stylized outputs from different intermediate layers for a specific content and a style On the left side, conv-1 layer output is used for content, and conv-2 layer output is used for style. On the right side, conv-4 layer output is used for content and conv-7 layer is used for styled output. After obtaining all this intermediate stylized output, we have tested the general DL attack simulation of Type-III whether it is possible to extract style information from these stylized images obtained from latent layers. The accuracy is 14% which is permissible compared to other works. It implies style recognition from intermediate layers of stylized images is a difficult task.
Faster RCNN is a deep convolutional network that is presented to the user as an integrated, single, and complete network for object detection. It is capable of accurately and rapidly predicting the positions of various objects (Figs. 14,  15). It is a fast and efficient object detector [52]. It detects and recognizes an object within the content image. However, the confidence score degrades if the same content image is blended with a stylized effect via NST. Table 4 shows the result of content/object detection from stylized images using Faster RCNN. The contents are detected but recognized incorrectly. Faster RCNN cannot recognize the objects fish and dog in Table 4 and Fig. 3. However, it recognizes other object classes such as flower, human face, and ship with certain confidence scores. In Fig. 13, Faster RCNN is applied to an image grid, and five out of nine images are detected correctly while the remaining images are not detected. Faster RCNN uses region proposal networks (RPN) to select the regions for pooling. In Faster RCNN, RPN is trained such that all anchors in a mini-batch of size 256, are extracted from a single image. For a single image, the features are correlated and easier for convergence while it is difficult for a blended or stylized image. As a result, though the correct region can be found, but the correct classification is not easily possible. Particularly, it is observed that the degree of recognition declines when the style images are dark and bold in nature. Sometimes, the dark lines within the style  The Google Cloud Vision AI API makes it effortless for developers to incorporate vision recognition capabilities into their applications, such as distinguishing between images, locating faces and landmarks, recognizing text with OCR, and marking explicit content. Google vision AI also fails to detect the images in the same image grid, shown in Fig. 12. It justifies our objective to match the styles rather than objects.

Randomness in stylization
To investigate the randomness in the stylization, we have computed the Structural Similarity Index (SSIM) and cosine similarity scores between the input styles and rendered stylized outputs. For this test, ten randomly selected classes of style images and stylized output images using ten different contents with the selected 10 styles.
In Fig. 16, we have plotted the SSIM and cosine similarities between these two input categories in the NST using heatmaps. It is evident from the heatmap that intra-style (same style, different content) and inter-style (different style, same content) rendering are highly random to reproduce similar outputs. For a particular style (intra) and 10 different contents, the SSIM value varies between 0.218 and 0.524. Likewise, inter-style variations lie within 0.337 to 0.369. Similar variations are also observed using cosine similarities. It is hard to maintain a trade-off between security and usability in CAPTCHA design. Any particular scheme cannot be resilient to the most possible attacks. Thus, it is considered as an open problem for robust security analysis.

Style extraction from SMC
Style extraction from stylized images is a major concern for attack analysis on SMC. We have carried out style transfer as stated in Sect. 3.2. Here, simply, a white image is considered as a content image, and the stylized image is used as a style image. The output of our experiment is depicted in Fig. 17. We have observed that the original style is not accurately regenerated. However, the colors and textures are likely to be reproduced to a certain extent akin to the original one. Inspired by this direction, we plan to conduct an in-depth study in near future, such as the suitability of auto-encoder and decoder architecture.

Resilience of SMC over denoising
Few recent schemes has applied denoising for breaking the text-CAPTCHAs [40]. Likewise, we have tested several types of noise added to the SMC to offer an extra layer of security, shown in Fig. 18. Particularly, we have included the Gaussian filter, median filter, and DnCNN [81] to remove the noise from images. Next, we have simulated a deeplearning attack on these denoised CAPTCHAs to evaluate the robustness and security of our proposed system. On the stylized images, a few conventionally noisy patches, such as circles, arcs, shapes, and lines, as well as periodic and style noises are included.

Effects of additional noises on SMC
A wide variety of noises with various sizes and colors, including random lines, random shapes, periodic noise, and blended noise are included over the SMC challenge for further study, depicted in Fig. 18. These types of noises deliver certain additional strengths, such as resilience against object segmentation attacks, and other schemes based on fundamental image pre-processing methods. However, in our case, (a) Determine the accuracy of four different circumstances using both style and stylized images for training and testing. Number of epochs: 50 and learning rate: 0.02. The results of the first two are obvious. Our prime interest is in the last two. The results are in Table 3.    the original style/pattern image can still be recognized by a human user, even after incorporating complex and random patterns and noise. In Fig. 18b.i, there are several different colored inclined lines, as well as thicker and thinner horizontal and vertical lines as noise. Random-shaped objects, including circles, squares, and triangles of different colors and sizes, are superimposed in Fig. 18c.i. Similarly, Fig. 18d.i includes periodic noise that can be from different directions. In order for users to recognize the style of a stylized image properly, the lines of periodic noise should be narrow. Figure 18e.i depicts blended-style noise on stylized images. Random styles are blended over the targeted-styled image with a low opacity. This increases the security of the SMC algorithm by making it more difficult to recover the actual stylized image.

Denoise techniques
We have applied several denoising methods for simulating deep learning attacks on noisy stylized images. In our case, cleaning the image is not an easy task due to the large variation in the noise applied to the images. We have attempted to use some denoising methods implemented in Matlab (Version R2020a), which are as follows.
-Gaussian filter (a low-pass filter) is used to blur certain areas of an image and reduce noise (high-frequency components). To attain the desired result, the filter is implemented as an odd-sized symmetric kernel which is passed over each pixel of the region of interest. The effect of applying the Gaussian filter to the noisy images is depicted in Fig. 18b-e. A Gaussian filter can effectively remove the salt-and-pepper noise from the images. However, it cannot easily remove the noises that have been used in the proposed SMC algorithm. This type of filter has futile effects in denoising our noisy stylized images. -Median filter is frequently used to eliminate noise from an image. It may sometimes preserve edges while reducing noise. However, the median filter is unable to eliminate the noise that we have applied, as shown in Fig. 18b-e. However, in a wider perspective of the image examples, it outperforms the Gaussian filter. In some instances, the color or texture of the shapes has been altered slightly. As a result, the median filter is ineffective for denoising our approach. -A pre-trained, simplest, and quickest denoising convolutional neural network is DnCNN [81]. It uses single-channel images as its input. To eliminate the noise, we have divided the noisy RGB image into three distinct color channels and employed a DnCNN. The denoised RGB image is created by recombining the three denoised color channels. However, the noise we    Fig. 18b-e.
Following the denoising operation, we have compared the denoised images with the original stylized images (from Fig. 18). The SSIM index and mean square error (MSE) are computed for comparison, and the results are shown in Table 5. We have observed that the SSIM value is on the lower side, while the MSE values are on the higher side. It indicates that noise reduction techniques are not efficient to remove the noises employed on stylized images of SMC.

Post-denoising attack analysis
To adhere insights on the denoising schemes, we have performed another attack analysis on the denoised stylized images and style images as described in Sect. 4.1. As shown in Table 3 (Type-III and Type-IV), we have selected 1950 denoised style images and 1950 denoised stylized images for the attack analysis after the denoising procedure. The training and test images are split in an 80:20 ratio, and each image is resized to 128 × 128 pixels. ResNet-50 and Inception-v3 are used as CNN backbones and are trained for 50 epochs with a mini-batch size of 8. The stochastic gradient descent (SGD) optimizer with a 0.01 learning rate is used for training. The result clearly shows that it is very difficult to attack successfully, if we apply user-defined noises in the images. Although, the noisy style and stylized images can be easily recognized by human users as described earlier.
After applying the noises, overall test accuracy decreases significantly for Type-III and IV schemes by 23% to 34% (Table 3). Indeed, this is a substantially lower success rate of an attack, whereas a higher success rate has been attained to break other schemes (like CAPTCHaStar [13] or Deep-CAPTCHA [48]). It demonstrates that with the mild use of standard noises in our SMC scheme, it is very hard to achieve a higher success rate on our threat model.

Denoising: a comparison with text-CAPTCHA
In SMC, stylized images are our main concern. In Fig. 18, we have applied various user-defined noises to the stylized image and tried to denoise it with some of the available denoising filters. However, we could not eliminate those noises. It means denoising operations might not be applicable to SMC. In text-CAPTCHA, the main objective is to recognize the distorted text/object. Whereas, in SMC, we need to recognize or extract the style for deep-learning attacks. We have created some text-CAPTCHAs samples and applied random lines as noises, as shown in Fig. 19. After several image processing operations, denoising becomes marginally effective to recognize the embedded-texts. Likewise, we have applied the same noises and denoising procedures for our stylized image. The resulting binary images could not represent the input style/pattern information. It evinces that effective denoising operations for text-CAPTCHA are not always apposite for solving SMC. This observation is our rationale to develop SMC.

Object segmentation and detection attack
SMC does not enquire to identify, localize or detect foreground object(s), which is mainly followed by other existing methods. Instead, we pose our challenge to match the pattern from a highly blended styled image. Hence, our method cannot be solved by object segmentation and detection tools. In addition, SMC is resilient to well-known image processing tasks such as boundary/edge detection, noise removal, and pixel-level segmentation. It is a major benefit of our proposal.

Random guess attack
The style-grid is a 3 × 3 image matrix, and three answers are essential to solve an SMC challenge, one per row for each session. Therefore, there are 3 possible cases of random guessing attacks.
-In the case of the global selection of three styles from all 9 in the grid, the probability of a random guessing attack is (9 × 8 × 7) −1 = 0.00198 = 1.98 × 10 −3 . Alternatively, the probability can also be computed using a general For the row-wise selection of one style (per row), the probability of a random guessing attack is 3 −3 = 0.0370 =3.70 × 10 −2 .
-Now, considering the size of the style-grid is 500 × 500 pixels, and each style pattern varies randomly within 100 × 100 to 160 × 160 pixels, with an average of 130 × 130 pixels. Using these spatial dimensions, simply applying a global selection strategy, the probability of random guessing is -In addition, the row-wise selection of a style from the grid is considered for probability estimation. Considering the same dimension of style-grid and the average height of each row 160 pixels and width of 500 pixels, the probability is Thus, it is not easy to pass an SMC by random guessing.

False accept rate (FAR)
It represents the probability of a bot's success to solve SMC. The false reject rate denotes the probability that a human is unsuccessful in solving it. An acceptable 1.5% FAR is bounded in [48]. FAR depends on the number of possible answers n s (here, n s is the total number of query styles in SMC) and the number of successful solutions of q challenges by a bot. It is defined as FAR= (n s ) −q . In [48], n s =8 and q = 2 have been considered, resulting in 1.5625% FAR. In SMC, n s =9 and q= 3, which imply FAR=0.137% which is much lesser than the 1.5% limit. In addition, considering q= 2, the FAR is 1.234%. Thus, SMC attains better FAR over others.

Usability study and result analysis
The human accuracy (%) and solving time (seconds) are computed for the performance evaluation of the users on SMC.
Accuracy: it is determined by the ratio of correct answers to solve SMC and the total number of responses.
Solving time: time (in seconds) taken to solve an SMC.

Experimental setup
Our SMC algorithm is implemented using PHP and MySQL and has been exhaustively tested on web browsers. To further analyze the effectiveness of the algorithm, we have conducted a usability study. For this purpose, we contacted our departmental students, faculty, and staff members and asked them to volunteer for the usability study. In order to facilitate the study, we have installed XAMP (Version 8.0.25) and SMC on 30 PCs. The volunteers are also present during the test sessions to monitor the proceedings. The student volunteers, faculties, and staff members of various departments from different institutions have participated in response and feedback collection by solving a set of random SMC queries. There are no businesses or sponsors involved in this research. The participants generously contributed their time and feedback to us. We have guided them through an illustrative session about SMC challenge solving and providing their feedback.
The participants acknowledged that they understood our research objectives and gave their consent to participate before the response collection process began. It should be noted that there is no financial business involved in this research. To ensure the user's privacy, their details are collected anonymously while maintaining ethical (15) Accuracy (%) =

Correct Responses Total Number of Responses
× 100 considerations. The participants agreed to share their overall experience during the response collection, which is invaluable to our research. We take great care to ensure that all privacy and ethical issues are addressed in our research efforts. Altogether 152 persons (male: 85 and female: 67 users) with various age groups between 8-65 years have participated in the evaluation task. Table 6 provides information about their age-group and gender. Figure 20 depicts the user completing the SMC challenge on a PC and Smartphone/mobile devices. In both cases, the correct styles are selected and highlighted with a colored rectangle. After correctly submitting the responses, the SMC is verified successfully. In addition, Fig. 20 also describes how the users are submitting their responses in the departmental laboratory. The user responses are collected from a local server in a structured datasheet for analysis. We have observed that many participants are already accustomed to using PCs and/or mobiles to solve CAPTCHAs on various websites. In addition, a few young children and elderly people lack any prior knowledge of how to solve a CAPTCHA. A brief description of the solution procedure of an SMC challenge is demonstrated to the participants by a group of 15 student volunteers who are involved during response collection. Next, each participant is requested to solve three SMC at three different sessions, i.e., 9 answers from each user. A total of 1368 (9× 152 users) responses are recorded accordingly. The human performance in solving SMC is given in Tables 7 and 8. Lastly, the participants have provided their remarks through a feedback form. Following the collection of all responses and feedback, we generate a final datasheet for our analysis.
In addition, SMC can easily be deployed on mobile devices (SMC-App) for a brief usability study. The responses are provided by 30 participants in our department laboratory. We have conducted a similar user-friendliness and satisfaction survey to collect their feedback. We have observed that the participants easily understood SMC-App and solved it within a reasonable time. We conducted all of the studies while maintaining ethical concerns and without any financial objective. The details are described in Sect. 5.4.

Solving accuracy
The average human accuracy (%) in solving SMC (Eq. 15) is 95.6% (Table 7). It improves in successive sessions when the users are familiar with the solving technique. However, there is a slight variation in accuracy between the sessions. It is interesting that humans take much less time to recognize an image even though it is distorted than typing the text (6-8 characters) to solve a text-CAPTCHA or a cognitive question. Figure 22a displays the accuracy of the solving SMC of all 152 participants in three different sessions categorywise. We observed that the accuracy is lower for category D, which is the group of aged people. The rest of the groups performed much better at solving SMC. The complete details of human accuracy to solve SMC are described in Table 7. In Table 9, we have compared the solving accuracy for SMC with an equal number of male and female participants (Fig. 21). We have observed that solving accuracy by the female (96.19%) is better than the male (94.62%). We have also compared the performance of the individuals who are accustomed to solving CAPTCHA with those who are not.
The result of these evaluations are given in Table 10. Those who are familiar with CAPTCHA have an average solving accuracy of 96.53%, while those who are unfamiliar have an accuracy of 95.90% only. It is evident that those who are familiar with CAPTCHAs perform a little bit more accurately. However, people who are unfamiliar with CAPTCHAs are also performing quite well considering their lack of experience. This indicates that CAPTCHAs are not overly difficult to decipher, even for those with no prior knowledge. The results of the evaluation suggest that  familiarity with CAPTCHAs does offer a slight advantage in terms of solving accuracy. However, it is important to note that the difference between the two groups is not particularly large. This implies that a basic understanding of CAPTCHAs is sufficient for most users to be able to decode them. Furthermore, it is clear that the majority of users are able to employ the knowledge they possess to accurately solve the CAPTCHAs presented to them.

Solving time
The participant's average solving time in seconds (s) to answer each SMC query at three different sessions is 6.59 s. In addition, Table 8 provides detailed information on the timely results of different categories of participants in order to solve SMC. Figure 22b shows the chart, where the average solving times of all 152 participants are plotted categorywise. It is observed that participant's skills have improved as they are solving more SMC at various experimental sessions. It is evident from Table 8 that the timely test result of category-A participants is excellent for solving SMC within 4.91 s during experiment-3 on session-1, which is minimal.
In Table 9, we have compared the solving time for SMC with an equal number of male and female participants. We have observed that the male's solving time (6.65 s) is better than the female's (6.59 s). We have also compared the average completion time of the individuals who are accustomed to solving CAPTCHA with those who are not. Table 10 shows the results of this comparison. Those who are used to solving CAPTCHAs had an average completion time of 6.65 s, while those who were not were slightly slower at 7.65 s. This indicates that those with more experience in dealing with CAPTCHAs are able to solve them more efficiently. However, the fact that those who are not familiar with CAPTCHAs were still able to complete the task implies that CAPTCHAs are not overly complex and can be solved by anyone with a reasonable level of understanding. In addition to this, the results also suggest that it is possible to improve the accuracy of CAPTCHAs by allowing people to become more familiar with them. By providing tutorials and other resources to help new users become accustomed to solving CAPTCHAs, it is possible to reduce the completion time while also increasing accuracy.
Thus, the solving time could effectively prevent the bots by imposing a time limit for responding to an SMC query. More visual explanation (e.g., histogram analysis and standard deviation on answering time taken by the participants) is given in Fig. 23.

Usability and feedback analysis on SMC-App
In addition to web-based usability testing via PC on SMC, we have conducted a rapid usability test with 30 users to assess the performance of our SMC-App. We have organized a comprehensive setup with 10 Android devices at our departmental laboratory. The SMC-App is installed on every mobile device and the volunteers have demonstrated the SMC-solving procedure through the SMC-App, i.e., how it works and how to solve it, to the participants. After a brief discussion, the participants submitted their responses three times at the allotted sessions. Each session is thoroughly monitored by our team of volunteers in order to ensure the accuracy of the results. In addition, the feedback from the participants is documented and analyzed to identify areas of improvement in the SMC-App. This enabled us to make the necessary changes and further optimize the usability, performance, and user experience of the SMC-App.
Our volunteers have conducted three sessions, during which they have collected a total of nine responses from each participant and received feedback on the SMC-App. After collecting all of the information, our volunteers have accumulated it for further processing and analysis. Upon examination of the data and responses from the mobile devices, we have determined that the users' performance has been highly satisfactory, which is an incredibly positive outcome. Interestingly, better performance has been attained through SMC-App than SMC (web version) in PCs. Table 11 shows the performance of the users in solving SMC on their smartphones. The average solving time of 30 users is 5.13 s. It shows that the performance via App is faster than PCbased testing. In addition, the average accuracy is calculated 96.33% which implies an improvement too. After response collection, the participants provided their experiential feedback. The users are asked to submit a score on 10-scale for each of the ten questions. Table 12 shows the mean score and standard deviation for each question of the survey on SMC-App.

Feedback analysis
In order to ensure the greatest ease of use and convenience, we have kindly requested each participant to provide their feedback. Following the conclusion of the answering sessions, all participants have supplied their feedback in accordance with eight questions, with each question being rated out of ten. A higher score denotes a higher value and quality of response.    Table 12 shows a summary of a general questionnaire and user responses regarding the SMC verification system. The questionnaire was composed of questions about the system's interface, easiness, reliability, robustness, and solving time. Each question was marked out of ten, with ten being the best score. The responses from the participants indicated that the SMC verification system was satisfactory. A few of them are discussed here. The interface was rated at 9.30 out of 10, with a standard deviation of 0.61. The easiness of the system was rated at 8.73 out of 10, with a standard deviation of 0.62. The reliability of the system was rated at 9.21 out of 10, with a standard deviation of 0.98. Finally, the solving time of the system was rated at 8.84 out of 10, with a standard deviation of 1.12.
These results showed that the SMC verification system was found to be generally satisfactory. The participants found the interface, easiness, language independence, and solving time of the system to be of a high standard. Moreover, their responses also suggested that SMC is able to rely on natural human behavior and easily distinguish between humans and bots. In conclusion, the results from the questionnaire and user responses have been analyzed, and it has been determined that the SMC verification system is highly usable, reliable, and secure. This indicates that SMC could be a viable CAPTCHA system for websites and applications. As the system relies on natural human behavior and is easily distinguishable from bots, it is expected to provide an effective security measure for online services. In addition, due to its ease of use, reliability, and fast solving time, the system could provide an excellent user experience. Therefore, we are optimistic that the SMC verification system could be a great CAPTCHA system for many different applications.

Performance comparison with state-of-the-arts
A comparison with state-of-the-art image-CAPTCHAs is presented in Table 13. Our method offers a well-balanced performance than those methods. Though the accuracy of ARTIFACIAL (99.7%) and HandCAPTCHA (98.6%) is higher than our SMC (96.6%), however, the response time and probability of attacks of SMC are significantly less than these two methods. The SACAPTCHA and Grid-CAPTCHA use NST, like SMC. However, their performances are lower than SMC. ARTIFACIAL and ASSIRA were attacked with a success rate of 18% and 82.7%, respectively in [84]. Whereas CAPTCHaStar was attacked with a success rate of 96% in [26]. Similar to our SMC, SACAPTCHA scheme employs NST and is attacked with a success rate of 96% in [50]. In comparison with these schemes, we have simulated attacks on SMC with a success rate of 34.72%, which indicates SMC is resilient to bots. It is clear that overall performance, including the accuracy (96.6%), the response time (6.52 s), and probability ( 3.83 × 10 −4 ), our proposed SMC offers a significant improvement over all approaches mentioned in Table 13.

Structural similarity index (SSIM)
An intuitive image quality metric named Structural Similarity Index, SSIM uses three attributes to quantify visual impact: brightness, contrast, and structure. These three attributes are multiplied to compute the overall index [76]. SSIM compares the content and style images with stylized images in SMC: where > 0 , > 0 and > 0 are parameters used to adjust the relative importance of the three components:  ( 2 x + 2 y + e 1 )( 2 The SSIM ranges [0,1], where 1 means a perfect match between the reconstructed image with the original one, and 0 implies no ideal match. Figure 25 describes the differences between the input content and reference stylized images and their differences using various metrics.
The SSIM map is a numeric array of non-negative integers with the same size as the input image that contains local values of the SSIM index. In the local SSIM map, small values represent dark pixels and those values indicate the locations where the input image differs from the reference image significantly. Large values of local SSIM appear as bright pixels. Regions with large local SSIM values correspond to uniform regions of reference images where NST has a small impact. In Fig. 25, the SSIM map is displayed for each comparison with style and content image versus stylized image.
Likewise, we have determined the Mean Square Error (MSE) and Peak Signal to Noise Ratio (PSNR) between these images. Our objective is to show how much the content and style images are blended through NST. The SSIM value is near about 0.5 which shows a clear difference between the images. It implies if the attacker has access to a stylized image database, then it is very hard to find the content and style images. In [11], it is evident that we can retrieve the original content from a stylized image if we have style information. However, it's tough to recover input style from the stylized output, which is a wide exploration region to investigate.

2D normalized cross-correlation (NCC)
Normalized cross-correlation (NCC) is a template matching algorithm in computer vision. It's a common method for determining the degree of resemblance (or dissimilarity) between two images. The key benefit of NCC over traditional cross-correlation is that it's less susceptible to linear variations in light amplitude between two images. It's a simple method for matching two image patches, which may be used for feature identification or as part of more advanced algorithms [41]. For 2D images, template matching uses a reference image which can be a sample of an original image or a synthesized prototype of the pattern for some other applications. The aim is to find if there is an occurrence and where, or at least a similar enough occurrence of the template in the target image. Correlation coefficients are returned as a numeric matrix with values within the [−1, 1] range and are defined as where x represents (x − u) and ỹ represents (y − v) , I is the image, ̄ is the mean of template , Ī u,v is the mean of I(x, y) in the region under the template.
We have compared a region of style (template) with the stylized image (target), in Fig. 24. First, a style is converted to a grayscale image, and a random square-sized region is cropped to check any similarity with the target stylized image. It is observed that NCC cannot find a proper region in the stylized image. From Fig. 24d, it is observed that there is no optimal peak value. It evinces that there is no significant similarity exists between the template and the target image. The NCC value for the template and target image is a matrix, ranging between ±1 . From this test, we have computed the maximum value as 0.3921 and the minimum value as − 0.43 from the NCC matrix. These values are neither close to +1 nor −1 . It signifies no similarity is found between the template and target images. Thus, it is tough to reconstruct the style image from the stylized image after style transfer, as no similarity is detected between the style and stylized images. Style image selection for style-grid in SMC is a significant challenge. We have studied that few style images are outwardly comparative and the SSIM index is likewise high for them. Considering this case, it is also hard to determine the style images from the stylized images. Figure 26 compares the SSIM index of similar types of styles and their stylized representations. Some SSIM values are near to 0.7, which is significantly high, and depict that style images are almost similar. It may result in a wrong answer to an SMC query.

Limitations and future work
In this work, SMC presents a more reliable, user-friendly, and secure image-CAPTCHA algorithm which addresses many challenges and implies an improvement over existing methods. However, currently, it bears a few limitations in various aspects, summarized below.
Architectural Design: SMC is adopted from elementary NST. In some recent works, several other issues have been handled to improve the robustness, computation time, model architecture, restoration, and other aspects of NST. Among these, an important issue is content leak [44] which is a major concern for security. However, as our target is to develop a novel image-CAPTCHA, thus, we have not focused on these important and more sophisticated design goals in the proposed SMC.
Database: In a few cases, the output stylized images might be rendered with the input styles with a degree of visual similarities. It might confuse the users when selecting the most appropriate styles, causing a Style conflict. As a result, it might be difficult to select the correct styles when solving SMC. A similar style and related stylized images are illustrated in Fig. 27. It is clear that if we choose style images that contain similar colors or similar types of textures/patterns, it will be difficult for the users to select the correct style images by observing the content and stylized images. As a result, it may decrease the solving accuracy, or take more solving time. The chances of such conflicting cases will be mitigated with a larger style dataset size.
Currently, we have tested our scheme with a smaller dataset, containing only 1950 images, which is not sufficient for a rigorous deep-learning attack simulation and usability study (Table 3). Intuitively, the inclusion of more styles and stylized images in the dataset could reduce the success rate of attacks.
In addition, Storing images in the database may cause serious security risks due to the potential of attackers gaining access to the database which is an obvious challenge to most of the image-CAPTCHAs.
Usability At present, our SMC is useful for persons with reasonable vision capability. SMC is not suitable for differently abled users with low vision. Moreover, the usability tests could be conducted with diverse variations with a large number of users for future study.
Security: SMC can reduce attack rates considerably, as described earlier. It thwarts several diverse types of traditional as well as deep-learning attacks. Currently, these aforesaid limitations of SMC will be tackled by developing a more secure and robust CAPTCHA in the near future. For example, to guard against phishing and spoofing attacks, adding an extra security layer may be beneficial. A detailed usability study will also be performed with a large number of participants and their demographic information. Furthermore, detailed attack analysis will be performed with a large database of style and stylized images. In addition, extraction of the original input style from the stylized images would be another direction of our study.

Conclusion
This paper proposes a Style Matching CAPTCHA, namely SMC, by adapting neural-style transfer underlying deep neural networks. SMC generates a random challenge and Fig. 27 The first two are similar style images. The next two are corresponding stylized images. Observing these two stylized images, it is difficult for the user to select the correct style inquires the users to match with the most appropriate style or pattern used in the stylization process along with a content image. Unlike other NST-based existing schemes, which ask the user to select a salient object (content) or area of interest, we have proposed SMC to find the style image used in the stylization task to solve a challenge by observing the semantic correlation between the content and styled output image. Our in-depth analysis demonstrates that SMC offers a wellbalance between usability and security to design an image-CAPTCHA. SMC randomly generates a challenge that is easily recognizable by humans, maintaining the difficulty for automated intelligent programs. Comprehensive security analysis implies that SMC can effectively thwart bot attacks with intelligent tools and deep learning models. Moreover, traditional image denoising-based attacks which are generally effective for text-CAPTCHAs are explored to analyze the strengths of proposed SMC. To improve the security and robustness, we will explore the suitability of composite styles (i.e., blending of two different random styles) in our SMC algorithm. In addition, two-stage NST can be an alternative solution to baffle deep-learning attacks from various latent layers. Overall, it is a positive approach to enhance the security of image-CAPTCHA design in a new direction.