Model optimization of English intelligent translation based on outlier detection and machine learning

With the actual demand of international communication, intelligent translation of English has become a key direction of artificial intelligence development in the current English field. For the existing English intelligent translation, how to deal with the massive data effectively is always a big problem. Therefore, it is necessary to use the principle of machine learning to optimize the English intelligent translation model. The purpose of this paper is to optimize the existing English intelligent translation model through spectral clustering to remove outliers, so as to make it more suitable for the use of massive data. Moreover, this paper uses deep learning methods to improve on the basis of the PoseNet network structure and adds regularization to the convolutional layer, which reduces the problem of gradient disappearance and reduces the computational complexity. In addition, this paper uses adaptive weighting to remove invalid model assumptions. In the conceptual space of the similarity matrix, the interior point is farther from the origin than the outlier point. The algorithm in this paper can detect and eliminate the outliers in a paragraph of English. At the same time, the normal English content will be classified into data categories, and the final translation result will be obtained. Through the experimental test, it can be seen that the model proposed in this paper has good performance and can cope with massive data, so it has certain superiority.


Introduction
As the exchanges between countries become closer, the exchanges between Chinese and English become more frequent.But for most people, they are not able to communicate uently in English [1].The traditional method of identi cation through an English dictionary can consume a lot of time, while simultaneous translation or having someone translate can add a lot of money, so it is not a cost-effective means of communication [2].However, English machine translation can establish a certain connection between the existing Chinese and English through machine learning, and convert them under the context, so that people can translate Chinese and English to each other [3], so as to facilitate communication between people.At present, machine translation has achieved good development in many elds.Every country has its own translation software, and many Internet enterprises have built their own free language translation system [4].
In general, machine translation in English is an interdisciplinary subject, which not only includes computer knowledge, but also includes the principles of linguistics, and conforms to a certain mathematical logic.Under the promotion of many Internet enterprises, English machine translation, with its characteristics of low cost, fast speed and easy to use, has gradually become the rst choice in English and Chinese communication [5].However, different English translation software has different emphasis, so each software has its strengths and weaknesses, and the results of different software translation are different.Due to the incomplete development of arti cial intelligence technology, English machine translation software can not recognize more contexts in many cases, so there are a large number of grammatical errors in the results of translation, which can only be used as an auxiliary means to help people understand English concepts, but can not be directly used as a written communication method [6].The results of machine translation need to be further processed by manual means, which increases the di culty of English translation [7].Therefore, it is necessary to choose effective means to improve the accuracy of English machine translation.This paper studies this.

Related Work
The basic principle of machine translation is simply a process of data comparison.There is a database in machine translation itself, which stores certain words, sentences and phrases, and these contents have built a certain relationship with Chinese content.When a piece of English is input, the database will make a quick data comparison, complete the connection between Chinese and English, and output the Chinese.However, in this process, only data classi cation and extraction will be carried out, and the input English will not be further analyzed [8][9].The literature proposed a neural network language model [10].Arti cial neural networks o cially assume the important task of natural language processing.Since then, many problems related to natural language processing will be considered using neural network models.In the previous research on machine translation, traditional machine translation dominates.This is why many neural network models only play a supporting role in statistical machine translation to improve the effect of statistical machine translation [11].
In view of the many problems existing in current machine translation, in order to evaluate the quality of machine translation, it is necessary to compare Chinese and English by language experts and explore the difference between human translation and machine translation [12].A common method is to use manual evaluation to score the translation, and the higher the quality of the translation, the higher the score.
However, the problem of manual evaluation is high cost and low e ciency.Therefore, an automatic evaluation method is proposed [13].In the past ten years or so, researchers have proposed many automatic evaluation indicators for judging the quality of translations, among which BLEU is currently the most widely used indicator.The BLEU indicator is simple and reliable and is used by various machine translation evaluation organizations as the o cial evaluation indicator of translation quality.This index compares the achievements of machine translation with the achievements of expert self-translation.The closer the two achievements are, the higher the score will be [14].
The application of machine translation has been very extensive, but what we are most concerned about is its translation quality.In layman's terms, when the input text is standardized and the structure is short, the result of machine translation can basically reach the level of our understanding [15].When the input original text is relatively long and has a complicated structure, the expression order of the translation result will be confused, and the translation of certain words will be wrong.When inputting the original text is highly colloquial or irregular network terms appear, the result of machine translation will be unsatisfactory, and it can hardly express the meaning of the original text.At present, the quality of machine translation is getting higher and higher, but compared with the translation level of professional translators, machine translation still needs to be further improved, and there are more problems that need to be further studied.Therefore, the development of machine translation is still very long [16].

Introduction To Model Fitting Method
The model tting method based on consistency analysis is a kind of method that appeared earlier, and the most representative algorithm is RANSAC and a series of improved algorithms based on it.RANSAC estimates the model parameters in an iterative manner, and randomly samples the smallest data subset from the observation data and uses the model assumption to test the rest.If a certain data point obeys the model assumptions, then this point is considered to be the interior point of the model assumption.
We focus on detailing the comparison method based on preference analysis.Among them, representative ones are: J-Linkage, T-Linkage, KF.J-Linkage: (1) Random sampling The minimum sample set is constructed by selecting adjacent points with higher probability of the same kind.In other words, if a point has been selected, the probability of being selected is as follows: 1 Among them, Z is a normalization constant and is the variance.Figure 1 is the result of straight line clustering based on preference matrix.
After sampling, the preference set of each point is calculated, that is, the model set whose distance from point to point is less than a certain threshold (same as RAN SAC).The number M of the minimum sample set MSS is related to the percentage of outliers.In the case of fewer outliers, the probability of obtaining MSS without outliers is greater.We assume that S is the number of interior points in a given model, and N is the total number of input data.In the J-Linkage algorithm, the sampling probability of the rst point is uniform.Therefore, there is .is the average distance between the interior point and the outlier.If there is .

2
In the formula, is the proportion of interior points of a given model.When is greater than , this strategy is easier to obtain MSS without outliers.Finally, for a given model, at least  MSS probabilities with no outliers are extracted from M, and the formula is given to nd the minimum sample set K with a given con dence and a given number of samples . Figure 2 shows the relationship between and M changes under different .

(2) J-Linkage clustering
Through the clustering of data points in the conceptual space, an estimated model instance can be obtained.The general clustering algorithm is bottom-up, after each scan, the two clusters with the smallest distance are merged.There are also many different algorithms for calculating the distance between clusters.J-Linkage proposed an improved method called J-Linkage.First, the distance between two elements (data points or data clusters) is recorded as the Jaccard distance: two sets A and B are given, and the Jaccard distance is:

(1) Sampling
In T-Linkage, a comparative experiment was done for the sampling strategy, as shown in Fig. 3.In the comparison results of local sampling and uniform sampling, it can be seen that under the condition of equal number of assumptions, local sampling is more helpful to produce accurate results.Therefore, combining uniform sampling and local sampling for sampling can more e ciently select neighboring points and obtain a more accurate sample subset.In this way, the sampling strategy not only uses local information, but also explores the hypothesis space.The minimum sample set (MMS) is obtained by sampling in the above manner, and M model hypotheses are generated, denoted as: Then, the preference matrix of each data point relative to its preference model assumption is obtained to construct the conceptual space.
(2) Concept space After generating M model hypotheses, a consensus set (CS) of each model instance is constructed.Like J-Linkage, this consensus set represents the set of points with a small distance to the model (less than a certain threshold).The characteristic function of the data point relative to the model assumption H is de ned.Comparing results as show in Fig. 3.
Under the assumption that the number is equal, the comparison results of uniform sampling and partial sampling: 6 Among them, represents the threshold of distance.This formula indicates that when the data point x prefers a certain model assumption ℎ, its characteristic function is 1.The function is in the entire closed interval instead of .This extends the concept space from the feature function set to: 7 It can express a point's preference more accurately and integrate more speci c residual information.The preference function PF of the de ned point is: 8 Among them, is a constant.It should be noted that when , the function is set to 0, because can almost be regarded as a constant when , and the function value does not change (change does not exceed 0.7%).
Using the preference function de ned above, we change the preference vector space in J-Linkage from to , making the preference vector in T-Linkage a continuous space.Such a preference function can better represent the preference of a certain data point x to the hypothetical model h.

Clustering Characteristics
The point characteristics in the clustering process are shown below: 1.For each data subset, there is at least one model in which all points represent positive preferences (that is, the data points are the interior points assumed by the model) 2. Two different data subsets cannot give positive preference to the same model In summary, based on the Tanimoto distance, similar data subsets are clustered.After clustering, outliers need to be removed, and nally the least squares tting method is used to estimate the model of each clustering point.Among them, is the dimensionality reduction of A. The higher the norm, the more points in the subspace, and vice versa, the histogram of the norm of the subspace vector is shown in Fig. 4.
(3) Multi-structure model data classes are generated After eliminating outliers, algorithms such as Kernel PCA, spectral clustering, and nonlinear dimensionality reduction are used to analyze the remaining data points in the RKHS space.LMedS is used to estimate a model instance from each point cluster.The goal of LMedS is to t the data with as few structures as possible, merge the data point sets sequentially through testing, and check whether the remaining data point sets can t the geometric model well after the two point clusters are merged.structure.If the geometric model instance t can be better satis ed, then the merge continues.

Model Construction And Performance Analysis
Figure 5 shows basic logic of machine translation.It can be seen from the gure that machine translation system is divided into data support and translation system.The data support is the machine translation database built by machine learning.Through data mining, key words, grammar and sentence structures are extracted from a large number of English data, and through data cleaning, sorting and induction, the database of the machine translation system is formed.Generally speaking, database is the support of machine translation system.Whether machine translation is more accurate or not is related to the size of database and the richness of data to some extent.The other part is the translation system.
The system preprocesses the collected external English information, arranges and decomposes it into relevant vocabulary, grammar and sentence structure, integrates and compares it with the content in the database, and nally obtains relatively accurate translation results.And the relevant translation results will be fed back to continue to improve the database.

Conclusion
The intelligent translation model has certain challenges in corpus collection, and the size of the data also has an important impact on the performance of the translation model.On the other hand, how to optimize and adjust the training model to improve the translation effect is still to be studied.Before constructing the similarity matrix concept space of data points, this paper uses adaptive weighting to remove invalid model assumptions, which reduces the computational complexity for subsequent removal of outliers.Secondly, in the conceptual space of the similarity matrix, the interior points are farther from the origin than the outliers.According to this distance distribution, the spectral clustering method that can automatically determine the subspace category is used to simultaneously remove outliers and generate model data classes.In addition, continue sampling on the inner points of the multi-structure model data obtained for the rst time, and repeat the above steps to obtain a cleaner data subset, thereby improving the accuracy.Finally, the performance of the translation model constructed in this paper is veri ed  Comparing results

Figure 1 Result of straight line clustering based on preference matrix Figure 2
Figure 1