Tag-less Back-Translation

An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of the back-translations of the target-side monolingual data. The standard back-translation method has been shown to be unable to efficiently utilize the available huge amount of existing monolingual data because of the inability of translation models to differentiate between the authentic and synthetic parallel data during training. Tagging, or using gates, has been used to enable translation models to distinguish between synthetic and authentic data, improving standard back-translation and also enabling the use of iterative back-translation on language pairs that underperformed using standard back-translation. In this work, we approach back-translation as a domain adaptation problem, eliminating the need for explicit tagging. In the approach -- \emph{tag-less back-translation} -- the synthetic and authentic parallel data are treated as out-of-domain and in-domain data respectively and, through pre-training and fine-tuning, the translation model is shown to be able to learn more efficiently from them during training. Experimental results have shown that the approach outperforms the standard and tagged back-translation approaches on low resource English-Vietnamese and English-German neural machine translation.


Introduction
Neural Machine Translation (NMT) [2,18,48] has been the state-of-the-art approach for machine translation in recent years [16,38], outperforming Phrase-Based Statistical Machine Translation [33] when qualitative parallel data between the languages is available in abundance [55].This training dataset is usually scarce and expensive to compile for many language pairs.Recently, researchers have proposed methods to exploit the easier-to-get monolingual data of one or both of the languages to augment the available parallel data and improve the performance of the translation models.Such methods include integrating a language model [20], back-translation [45,22,19], forward translation [53] and dual learning [21].The back-translation approach is simple and has been the most effective technique yet for NMT [16,22].The method involves training a target-to-source (backward) model on the available authentic bitext.The backward model is then used to translate a large amount of monolingual sentences in the target language into synthetic source sentences, generating the synthetic parallel data.The authentic and synthetic parallel data are then mixed to train a source-to-target (forward) model.
It has been shown that as the amount of monolingual data used in backtranslation continues to increase, a point is reached when the model stops learning useful representation and, therefore, the performance of the model starts to drop.This is because the usually noise-infested synthetic data starts to overwhelm the authentic data and the model starts to completely unlearn the correct parameters it learns from the authentic training data [17].Extensive studies by [16] have shown that in low resource NMT, noising beam search outputs improve the models more than other generation methods such as sampling.The authors claimed that the method enhances source-side diversity.But the works of [6,52] found that the noising technique is only a form of tagging, indicating to the model that the noised data is back-translated, enabling it to treat the synthetic data as belonging to a different domain.The model then learns different representations, optimally, from the two data.They, instead, introduced the use of explicit tags (and gates) to indicate synthetic inputs.The tagging approach was shown to outperform the standard back-translation.
In this work, we approach back-translation as a domain adaptation problem, simplifying the works of [16,6,52] that explicitly differentiate between the two data using noise/tags/gates.Instead of tagging the synthetic data, our approach -the tag-less back-translation -aims to enable the model to learn efficiently from the two data through pre-training and fine-tuning.Instead of relying on the model to differentiate between the data, we used the synthetic data as generic domain (out-of-domain) and pre-train the model on this data.We then used the authentic data as in-domain to fine-tune the pretrained translation model.We hypothesize that although the tagging and noising approaches improve the forward models, our domain-adaptation-tailored approach will provide a flexible method of maximizing the gains in the quantity of the synthetic data and efficiently utilizing the quality in the authentic parallel data.The approach will also enable the use of different training settings on the different data, as obtainable in domain adaptation strategies.It also gets better as more research and more improved ways of domain adaption are proposed.
In domain adaptation, the generic model is not always expected to perform very well in the domain it is to be deployed, hence the model is fine-tuned with a usually smaller but in-domain data.In many languages, the in-domain data is usually low-resourced or non-existent: having the same issue as in low resource neural machine translation.The in-domain data in itself is not sufficient to create a good model while the more abundant out-of-domain data performs poorly when deployed in the target domain.Mixing the two data results in the in-domain data to be lost in the out-of-domain data and the resulting model is not able to also perform well in the target domain.The larger out-of-domain data is, therefore, used to pre-train a model and the weights of this model are used to initialize the training of the in-domain translation model -a technique referred to as fine-tuning [11].When a different language pair is used for pretraining than that used during fine-tuning, the approach is regarded to as transfer learning [55,28].
We make the following contributions in this paper: • we proposed a novel approach that enables a translation model that is trained on synthetic and authentic parallel data to be able to efficiently learn from the the two data, utilizing the different advantages presented by each.• we successfully applied pre-training and fine-tuning to enable the forward model in back-translation to differentiate between synthetic and authentic data during training, achieving a superior performance to standard and the successful tagged back-translation approaches, • experimental results have shown that the approach is superior to the standard and tagging back-translation approaches in low resource English-Vietnamese and English-German neural machine translation systems.
The remaining sections are as follows: Section 2 reviews relevant literature on NMT, leveraging monolingual data in NMT and pre-training and finetuning.Section 3 explains the tag-less back-translation approach, Section 4 describes the data and experimental set-up used in training the models, Section 5 discusses the results obtained after the experiment.We discuss further the findings in Section 6.Finally, in Section 7, we concluded the work and suggest future directions.

Related Works
This section presents prior work on NMT, back-translation and pre-training in NMT.

Neural Machine Translation (NMT)
The NMT is based on a sequence-to-sequence encoder-decoder system with attention mechanism [2,47,34].The encoders and decoders are made of neural networks that model the conditional probability of a target sentence y given the source sentence x: p(y|x) .The encoder converts the input in the source language into a set of vectors while the decoder converts the set of vectors into the target language through an attention mechanism, one word at a time.The attention mechanism was introduced to keep track of context in longer sentences [2].
The NMT model produces the translation sentence by generating one target word at every time step.Given an input sequence X = (x 1 , ..., x Tx ) and previously translated words (y 1 , ..., y i−1 ), the probability of the next word where s i is the decoder hidden state for time step i and is computed as Here, f and g are nonlinear transform functions, which can be implemented as long short-term memory (LSTM) network [23] or gated recurrent units (GRU) [9] in recurrent neural machine translation (RNMT), and c i is a distinct context vector at time step i, which is calculated as a weighted sum of the input annotations h j Tx j=1 a i,j h j where h j is the annotation of x j calculated by a bidirectional Recurrent Neural Network.The weight a i,j for h j is calculated as and where v a is the weight vector, W and U are the weight matrices.All of the parameters in the NMT model, represented as θ, are optimized to maximize the following conditional log-likelihood of the M sentence aligned bilingual samples To remove the recurrence and enable parallelization across multiple GPUs during training, the convolutional neural networks were used to create the convolutional NMT (CNMT) encoder-decoder architecture [18,51].The CNMT utilizes 1-dimensional convolutional layers followed by gated linear units, GLU [14].The decoders compute and apply attention to each of the layers.The model uses positional embeddings along with residual connections [18].
The transformer [48,15] architecture was introduced to remove the recurrence and convolutions of previous architectures.The transformer is based solely on multi-headed self-attention layers.It enables parallelization across multiple GPUs, thereby, reducing training time.The architecture is used in current state-of-the-art translation systems [16,38].
In this work, we used a unidirectional LSTM encoder-decoder architecture with Luong attention [34].This is a simple recurrent neural network RNMT architecture.Our approach is not architecture-dependent and can be applied to the other architectures or other more enhanced implementations of the RNMT.

Leveraging Monolingual Data for NMT
The use of monolingual data of the target and/or source language has been studied extensively to improve the performance of neural translation models, especially in low resource settings.[20] explored integrating language models trained on monolingual data into NMT systems, [12,5] proposed augmenting a copy or slightly modified copy respectively of the target data as source, [45] proposed the back-translation approach, [53] proposed the forward translation and [21] used both forward and back-translations to improve the translation models.The back-translation approach has been shown to outperform other approaches in low and high resource languages [16,22].
Various studies have investigated back-translation to improve the backward model, to select the most suitable generation/decoding methods for generating the synthetic data and to reduce the impact of higher ratio of the synthetic to the authentic bitext.The quality of the models trained using back-translation depends on the quality of the backward model [16,17,22,5,19,29,52].To improve the quality of the synthetic parallel data, [22] used iterative back-translation -iteratively using the back-translated data to improve both the backward and forward models.[29] and [13] used high resource languages through transfer learning and [54] explored the use of both target and source monolingual data to improve both the backward and forward models.
[37] trained a bilingual system based on [25] to do both forward and backward translations, eliminating the need for two separate models.[41] used Transductive data selection methods to select monolingual data that are in the same domain as the test set for back-translation, improving performance.
The works of [17,42] have found that the ratio of synthetic to authentic data affects the performance of the models most.When the ratio is high, the model tends to learn more from the synthetic data, which contains more mistakes than the authentic data.Investigations have found that the sampling approach of synthetic data generation and adding noise to beam search output outperforms the regular beam decoding technique [16,24].These approaches were said to improve the models by enhancing source-side diversity.[6] claimed, instead, that the noise only indicates to the model that the input is either synthetic or authentic, enabling the model to better utilize the two data.[52] and [6] used tags (and gates) to enable the model to distinguish between the data and the approach has been shown to efficiently utilize more synthetic data, outperforming standard back-translation and enhancing the efficiency of iterative back-translation.

Domain Adaptation
Domain adaptation is the use of a usually few amount of in-domain data to improve the performance of an out-of-domain (general purpose) model before deployment.The amount of the in-domain parallel data is usually not sufficient to train a very good model and the general purpose models usually performs poorly [32].There are two categories of domain adaptation -data centric and model centric [11] with each having several techniques.The techniques in these classifications include using monolingual data [20], synthetic data generation [45], using data selection [49] and using tagged out-of-domain parallel data [10] and fine-tuning [45] Pre-training has been used successfully in various machine learning tasks to improve performance when the data is not enough to train a good enough model.It was used for training word embeddings [35], in computer vision [50], fine-tuning NMT models [16] and as transfer learning in low resource NMT [55,28].The transfer learning for machine translation approach involves training a model on a high resource language pair and transferring the training on a low resource pair.The works of [55,36,28] have shown tremendous improvements over models that are trained with the low resource data from scratch.
In back-translation, [45] showed that fine-tuning a pre-trained model on indomain data improves the quality of back-translated model.[43] pre-trained the model on the authentic data and fine-tunes it on the mixed synthetic and authentic data.[29] and [13] pre-trained a model on a high resource language and fine-tunes it on a low resource language pair.

The Proposed Method
The approach is shown in Fig. 1.As illustrated in Algorithm 1, the authentic parallel data: Instead of mixing the two data to train a forward (target) model, we used only the synthetic data to pre-train the forward model, M x→y , until no improvement is observed on the development set.Finally, the forward model is fine-tuned on authentic data.
It was shown in [29] that using different vocabulary each during pretraining and fine-tuning leads to drop in performance because, it was said, independent vocabulary use different identifiers even for the same subwords and the network loses benefits of the weights learned during pre-training.The authors proposed learning a joint BPE on a mixture of both the pre-training and fine-tuning data and this has been shown to achieve better results in domain adaptation.In this approach, we have access to both the out-of-domain (synthetic) and the in-domain (authentic) parallel data.This, therefore, enables us to learn a joint BPE and build the training vocabulary for both pre-training and fine-tuning.

Set-up
We used the TensorFlow [1] implementation of the OpenNMT [27] framework to train the models -the NMTSmallV1 configuration.The set-up is based on the NMTSmallV1 configuration.Specifically, the configuration is a 2-layer unidirectional LSTM encoder-decoder model with Luong attention [34].It has 512 hidden units and a vocabulary size of 50,000 for both source and target languages.We used Adam [26] optimizer and a batch size of 64 with a dropout probability of 0.3, a static learning rate of 0.0002 and the models are evaluated on the development set after every 5,000 training steps.The models were evaluated using the bi-lingual evaluation understudy metric, BLEU [39], specifically the multi-bleu [31] implementation.The models are trained until there is no improvement of over 0.2 BLEU after four training steps.As stated in Section 3, the learning of BPE on the training data and the building of training vocabulary for both pre-training and fine-tuning was done on the mixture of the synthetic and authentic parallel data.During fine-tuning, we, therefore, only change the training data.

Data
For this work, we use the preprocessed low resource English-Vietnamese parallel data [34] of the IWSLT 2015 Translation task [7].We used the 2012 and 2013 test sets for development and testing respectively.We also used the data from the IWSLT 2014 German-English shared translation task [8] as the second language pair, pre-processed using the data clean-up as well as the train, development and test split in [44].For the monolingual data, we used the preprocessed English monolingual data of WMT 2014 English-German translation task [4].We shuffled the monolingual data and selected 666,585 monolingual sentences which is five times as much as the En-Vi parallel data.The statistics of the datasets are shown in Table 1.We learned byte pair encoding (BPE) [46] with 10,000 merge operations on the training dataset and applied it on the train, development and test datasets.Afterwards, we build the vocabulary on the training dataset.For all the experiments, we used thrice as much of the monolingual as the available parallel data in both of the languages except when we experimented with the ratio of 1:5 (parallel to monolingual data) for the English-Vietnamese NMT.

Models
To compare the performance of our approach with that of the previous works, we implemented the following methods to train translation models on the English-Vietnamese and English-German NMT using the data presented in Section 4.2 above.All models were trained using the same settings stated in Section 4.1 • We first train baseline models using the available authentic parallel data only.In the models, the baselines have the English language as the target language -Vi-En and De-En.• We then train the backward models also on the authentic parallel data using English language as the source language -En-Vi and En-De.The models are used for the generation of the additional synthetic parallel data for the back-translation approach.• We implemented the various back-translation strategies namely standard back-translation -standard bt, the tagged back-translation -tagged bt and the tag-less back-translation -tag-less bt (joint BPE) using the authentic and synthetic parallel data.

Results
All scores reported are statistically significant with p < 0.05.We used the paired bootstrap resampling of [30] as implemented in [40] to estimate the statistical significance confidence scores.See Table 13 in Appendix 1 for confidence scores.

English-Vietnamese Low Resource NMT
The evaluation scores of the best models and the improved models obtained after taking the checkpoint averaging of the last 8 checkpoints are shown in Table 2.We first created a forward Vietnamese-English (Vi-En) model, baseline, on the available authentic parallel data.The model trained for 75,000 steps before the stopping condition was met.The baseline model was trained further to 110, 000 training steps but the performance continued to flatten without observing any improvement.The model achieved the best single-checkpoint We mixed the two data -synthetic and authentic -without differentiating between the two and used the resulting large dataset to train a forward model.We labelled this model as standard bt -for standard back-translation [45].This model was trained for 165,000 before the stopping condition were met.It achieved a single-checkpoint best BLEU score of 24.46 at the 155,000 th training step.We mixed the synthetic and authentic parallel data and learned a joint BPE on the resulting training dataset and build the training vocabulary.We applied the BPE on the synthetic data for pre-training and on the authentic data for fine-tuning.We trained a model, labelled tag-less bt (joint BPE), using this approach.The model achieved a single-checkpoint score of 18.60 BLEU during pre-training and improved to 26.53 BLEU after fine-tuning.The average fine-tuned model was better by about 0.30 BLEU.The average pretrained model performed very low compared to the baseline and the standard back-translation models -18.59 BLEU vs 22.22 and 25.28 BLEUs respectively.This is obviously because the quality of the data used in the training the model -the synthetic data -is lower than that of the other two.The quality of the synthetic data, although generated from a reasonably good backward model, is still not sufficient to train a model whose quality can compare to the other models that are trained in whole or in part with the authentic data.Finetuning the model on the authentic data results in a sharp rise in performance.The model was fine-tuned until the stopping condition was met.The approach outperformed the baseline and standard back translation models by 5.34 and 2.07 BLEUs respectively.The gap in performance was, however, reduced to 4.61 and 1.55 BLEUs after checkpoint averaging.
We experimented the other pre-train and fine-tune approach, learning the BPE only on the synthetic data.We build the vocabulary on the synthetic training data after applying the BPE.The synthetic corpus was used to pretrain a forward model for 130,000 steps, achieving a single-checkpoint best score of 17.85 BLEU.The authentic parallel data was then used to fine-tune the model for a further 35,000 training steps.Stopping at each of these steps were based on the stopping condition.The performance of the tag-less bt model improved to 25.16 BLEU after fine-tuning.Although this approach was shown to outperform the baseline and standard back-translation, it underperformed the joint BPE implementation of the tag-less approach by 1.06 BLEU.In Figures 2a and 2b, we show how the BLEU scores continue to improve with increase in the training steps.The model trained using the tag-less bt (joint BPE) approach continued to outperform the three others after fine-tuning.

English-German Low Resource NMT
We conducted the same set of experiments presented in section 5.1 on the second low resource dataset, the English-German IWSLT'14 parallel dataset.This data, as presented in Table 1, is made up of a little bit more than 150,000 parallel sentences.We first trained a backward (En-De) model on the available parallel data.This model maxed-out performance on the test set, based on the set-up, at 10.25 BLEU after averaging the last 8 checkpoints.It stopped training at the 80,000 th training steps and achieving the best single model performance at the 65,000 th -10.03 BLEU.We used the average model to generate the synthetic data, translating the available English monolingual data.We trained four separate forward (De-En) models based on the approaches we explained earlier.The first is the baseline trained on the available authentic data, the standard bt on the mixture of the authentic and synthetic data without differentiating, the tag-less bt pre-trained on the synthetic data and fine-tuned on the authentic data having learned the BPE on the synthetic data and updating the vocabulary before fine-tuning and, finally, the tag-less bt (joint BPE) trained also using the tag-less approach but having learned the BPE and built the vocabulary on the mixture of the synthetic and authentic data.
The results of evaluating the models after training using the various approaches are presented in Table 2 The baseline achieved a modest average performance of 20.95 BLEU after training for 100,000 training steps.The performance on the dataset was improved after applying standard back-translation, achieving a huge +4.92 BLEU improvement over the baseline.The tag-less approach, though better, did not achieve a huge improvement over backtranslation (only +0.16 BLEU) but after applying the improved tag-less (joint BPE), as shown in the previous section, we achieved huge +2.96BLEU increase in performance.This +2.8 and +7.88 BLEUs over the previous tag-less approach and the baseline respectively.
For all the subsequent experiments, unless stated otherwise, we used the joint BPE technique to implement the tag-less back-translation approach as it is shown to be the most successful variant.

Tagged Vs Tag-Less Back-translation
We compared the performance of the tag-less bt model -our technique -with that of the successful tagged back-translation of [6] on the English-Vietnamese data.The synthetic sources were labelled with the <BT> token at the beginning of each sentence and mixed with the authentic sources to generate the mixed tagged parallel corpus.This mixed data is used to train the forward tagged back-translation model -tagged bt.The tagged bt model stopped at Finally, we trained a forward model using the tagged back-translation for English-German NMT to compare the performance with our approach on this data.The tagged approach took a further 50,000 training steps to reach a single model best of 27.49BLEU, but still underperforming the tag-less approach by 0.82 BLEU.The best model obtained after averaging checkpoints was also achieved using our approach, a performance of 28.83 BLEU compared to the tagged 27.75 BLEU, an improvement of 1.08 BLEU.The performances of these models, evaluated on the test set is shown in Table 3.On this data, the tagged approach performed better than the standard back-translation by +1.88 BLEU on the average models.It can be seen in Figures 2 and 3 that in both of the experiments conducted on the two data, our tag-less approach out-performed the rest of the back-translation approaches.
This supports the hypothesis that although the tagged back-translation involves explicit differentiating between the two data using tags, the model trained on the approach may not be able to differentiate between them com- pletely during training as observed in the mixed performance of the models trained on the two different data.

Fine-tuning: Synthetic Vs Authentic Data
Our technique proposed pre-training the forward model on the synthetic parallel data and fine-tuning the model afterwards on the authentic data.This was proposed to enable the model to unlearn the mistakes it learned from the synthetic data using correct sentences in the authentic parallel data.We experiment the other way round to investigate the effects of pre-training on the authentic data and fine-tuning on the synthetic data.We used the baseline as the pre-trained model and fine-tune it on the synthetic data.This approach was labelled as reverse tag-less bt.This approach did not show any benefit to the final forward model, see Fig. 3.As expected, the performance of the model decreased and the curve flattens as the number of training steps increases.The best and average scores are shown in Table 3.

Quantity of Monolingual Data
As stated earlier, it was found that as the more synthetic data increases, a point is reached where the performance starts to deteriorate [17].Instead, our work hypothesizes that the performance of the model will start to decrease only if it is not able to differentiate between the synthetic and authentic training data and, therefore, efficiently learning from the two.We also pointed out that since the data is mixed in both the standard and tagged back-translation approaches, the model may not be able to completely differentiate between the data, although in the latter approach, the model is expected to treat the tagged synthetic sources as a different domain.We, therefore, experiment with different ratios of the authentic to synthetic data to verify this claim.We sample the authentic to synthetic data in the ratios, 1:1, 1:3 and 1:5.The results are shown in Table 5.
In the tagged approach, the single-checkpoint best scores continue to rise from using the same amount of monolingual data for back-translation to using three times the authentic data of the monolingual data for back-translation.But, as observed, the performance dropped slightly when we used five times the amount of available parallel data.However, the performance of the tagless back-translation models continues to increase steadily when the ratio of authentic to monolingual data is increased.We observed the performance to improve by about 0.25 BLEU when the amount of synthetic data is tripled and doubled to about 0.5 BLEU after adding another double amount of the synthetic data to the training data.It can also be observed that there was a very low improvement over the performance of baseline and serious underperformance compared to the tag-less approach when we used the same amount of synthetic data with the authentic data to train the models -22.73 BLEU vs 22.22 and 26.15 BLEUs respectively.Overall, we obtained a 3.42, 1.78 and 1.67 BLEU improvements on the average models using the tag-less approach over the tagged approach on the ratios experimented respectively.It can be observed also that the performance of the model trained on the 1:1 ratio of monolingual to synthetic data using our approach is very good compared to the model trained using the same amount of data in the tagged approach and subsequent increase in training data leads to steady improvements that at 1:5 ratio, the performance was improved by about 1 BLEU.This steady improvements can show that the model learned useful knowledge on the authentic data but only used the synthetic data for further improvements.Following this trend, we can, therefore, conclude that with more synthetic data compared to the authentic data, the model will only continue to learn and increase its performance if useful representations are learnable on the synthetic data.

Fine-tuning Standard And Tagged Back-Translations
The work of [43] reported no observable advantage of using the authentic data to train the forward model and fine-tuning it henceforth on the mixed data.Instead, we experimented training the forward model on the mixed data first and then fine-tune it on the authentic data.As shown in Table 6, this approach reaches the same performance as the old tag-less approach -25.15 BLEUusing the same amount of synthetic sentences albeit after 30,000 more training steps but sill underperforming the joint BPE tag-less back-translation's 26.53 BLEU although training for additional 15,000 training steps.The better joint BPE tag-less approach converges earlier than fine-tuning the standard backtranslation model, at 165,000 training steps.We also explored the use of fine-tuning to determine whether or not the tagged approach will be able to cover the difference in performance with the tag-less approach.After fine-tuning the tagged bt (1:5) model for a further 35,000 training steps, the performance gained was a significant +0.93 BLEU and only 0.17 BLEU over the average after just 20,000 steps of fine-tuning.The performance was still short of the tag-less bt (joint BPE) (1:5) by a significant 1.36 BLEU.

Discussion
In this work, we proposed an approach for training the forward model in back-translation without using tags or noising the synthetic data.Translation models that are trained on the synthetic and authentic data have been shown to perform better when they are able to differentiate between the two data.Previous approaches have relied on the use of noise in back-translation [16] especially on low resource languages to improve the performance of models.The authors thought that the approach ensures source-side diversity which has been shown to benefit the models [24].The approach was found out to only indicate to the forward model that the noised data is synthetic, enabling it to treat the data differently from the authentic data [6].The use of tags has been shown to improve the performance of such models.In this work, we eliminated the need of using of the tags and showed that although it was successful at improving the performance -proving it successful at indicating to the model that a data is synthetic and not authentic -domain adaptation methods are more capable of ensuring the model differentiate between the data.The ability for the model to separate between the data is even more important in low resource languages where the available data is not enough to train standard backward model, thus generating synthetic data with a lot of noises.
Domain adaptation techniques techniques in machine translation ensures that a better model is trained, leveraging on a larger parallel data of either the same language pair but in a different domain -fine-tuning -or a different language pair -transfer learning.In this technique, the two data are not tagged, mixed and left to the model to differentiate between them.They, rather, are used at the different stages of the training and this ensures the model performs in the target domain as expected.We utilized the synthetic data -which is bigger but more prone to translation noises -as the generic domain and the authentic data -smaller but having more quality -as the in-domain.This selection was not done until the reversed approach was shown not give the desired performance.The superiority of the approach over the successful tagging was shown through experimental results conducted on two low resource language pair: English-Vietnamese and English-German.In each of the languages considered, we obtained an improvement of more than 1 BLEU points over the tagged approach that outperformed the baseline and standard back-translation models.
We also test the performance of our technique when the amount of monolingual data is increased.We used different ratios of the authentic parallel to monolingual data used.We found that our technique was not only able to handle the increase in the synthetic data, but was able to attain rapid improvement given the smallest amount of synthetic data.We obtained a superior performance by a whopping 3.56 BLEU using the tag-less approach over the tagged approach when the amount of monolingual data is the same as the authentic data.The performance continued to steady increase as the amount of monolingual data is increased.The tagged approach could only handle tripling the amount of synthetic data but the performance started to decrease when the synthetic data was increased further.Using the same amount of synthetic data in ratios 1:1, 1:3 and 1:5, our technique outperformed the tagging technique by 3.42, 1.78 and 1.67 respectively (see average scores in Table 5).
Our approach also provides one with the flexibility of using state-of-the-art domain adaptation methods to improve the performance of the already successful back-translation approach.Techniques such as using different dropout and/or learning rate during pre-training and fine-tuning may improve the performance of the forward model.The method may also be applied in highresource languages since both of these settings -low and high resource -can benefit from the ability of the forward model to differentiate between synthetic and authentic data.

Conclusions and Future Work
This work has shown that an NMT model pre-trained on synthetic data and fine-tuned on the authentic data outperforms the rather successful method of tagging the synthetic data in low resource NMT by enabling the forward model to differentiate between the authentic and synthetic training data.The approach, however, does not improve the performance when it is reversed and the forward model is pre-trained on the authentic data and then fine-tuned on the synthetic data.As expected, the reverse approach makes the model to unlearn the useful representations learned in favour of the noise in the synthetic data.This justifies our hypothesis that without differentiating between the two data, the synthetic data is most likely to hurt the performance of the forward model.
It was shown also, in this work, that the more synthetic data used, the better the performance of the forward model, though the most effective ratio was not yet determined through thorough experimentation.This will inform the basis of future works.We experiment fine-tuning the models trained using the standard and tagged back-translation approaches.Experimental results showed the standard back-translation equalling the performance of a variant of the tag-less approach after many more rounds of training.The performance of the tagged approach improved considerably but still trailed the tag-less approach.The most successful of the tag-less approach has been the one that involves learning a joint BPE and building the training vocabulary on the mixture of the synthetic and authentic parallel data.This approach is made possible, unlike in other fine-tuning conditions, because both the generic (synthetic) data and the in-domain (authentic) data are available during the process.
For future work, the use of different settings -such as increasing or decreasing the learning rate, using dropout and L2 regularization, which may reduce overfitting on the in-domain (authentic) data as shown to be a likely problem in domain adaptation by [3] -for the pre-training and fine-tuning approaches can be explored to maximize the benefits of the domain adaptation approach in back-translation.The approach can also investigated to improve the forward translation approach -which also leverages on the synthetic data for additional training data.Finally, we intend to investigate the technique in high resource languages in the future.

Fig. 1 :
Fig. 1: Tag-less Back-Translation: Training the forward model on the synthetic parallel data generated using the backward model.The forward model is then fine-tuned on the authentic data.

Algorithm 1 :
Tag-less Back-Translation Input: Parallel data D P = {(x (u) , y (u) )} U u=1 and Monolingual target data Y = {(y (v) )} V v=1 1: procedure BACK-TRANSLATION 2: Train backward model Mx←y on bilingual data D P 3: Use Mx←y to create D = {(x (v) , y (v) )} V v=1 , for y ∈ Y ; 4: Pre-train forward model Mx→y on parallel data D ; 5: Fine-tune the forward model Mx→y on parallel data D P ; 6: end procedure Output: forward model Mx→y

Fig. 2 :
Fig. 2: Tag-less back-translation: pre-training on synthetic data and fine-tuning on authentic data.Showing how this technique compares to the baseline and the standard back-translation approaches on the test set.

Fig. 4 :
Fig. 4: Fine-tuning the baseline model on the synthetic data.Evaluation scores on the test set.

Table 1 :
Data Used

Table 2 :
Performance of the Tag-less Back-translated model compared to the baseline and standard back-translation models for Vietnamese-English and German-English translations.Evaluation scores on the test set.The tag-less approaches show results of pre-training and fine-tuning.score of 21.19 BLEU at the 65,000 th .We then trained a backward (En-Vi) model, backward.After the stoppage condition were met, after 55,000 training steps, the best performing single-checkpoint for the backward model achieved a BLEU score of 24.78 at the 50,000 th training step.Averaging the last 8 checkpoints gave the best performance -25.79 BLEU.This average model was used to back-translate the monolingual English data to generate synthetic parallel data.

Table 3 :
Performance of Tag-less Back-translation compared to the Tagged back-translation model for Vietnamese-English translation.Evaluation scores on the test set.

Table 4 :
Performance of Tag-less Back-translation on the test set: pre-training on synthetic data and fine-tuning on authentic data Vs pre-training on authentic data and fine-tuning on synthetic data for Vietnamese-English.
125,000 steps and the training was continued up to 195,000 steps to equal the number of training steps reached by the tag-less bt model.While the tagged approach underperformed the best score of our technique by 1.78 BLEU, it was able to outperform the single-checkpoint standard back-translation by 24.76 to 24.46 BLEUs respectively (+0.3 BLEU) but underperformed the average standard back-translation model by 0.23 BLEU.

Table 5 :
Using different ratios of the authentic to synthetic data for Vietnamese-English translation.Evaluation scores of the models on the test set.

Table 6 :
Before and after fine-tuning the English-Vietnamese standard and tagged back-translation NMT models on the authentic data.Evaluation scores on the test set.

Table 8 :
Performance of Tag-less Back-translation compared to the baseline and standard back-translation models.BLEU Scores for each Checkpoint of the Models for German-English NMT (best single-checkpoint and average scores are shown in bold).

Table 9 :
Tagged Vs Tag-less Back-translation.We used the joint BPE for implementing the tag-less approach.BLEU Scores for each Checkpoint of the Models (best single-checkpoint and average scores are shown in bold).

Table 10 :
Pre-training on the authentic data and fine-tuning on the synthetic data for Vietnamese-English NMT.BLEU Scores for each Checkpoint of the Models

Table 11 :
Using different ratios of authentic to synthetic parallel data and its effect on the performance of Vietnamese-English NMT.Evaluation scores (BLEU) on the test set for each checkpoint (tag-less bt colour code: BLACK -pre-train, RED -fine-tune)

Table 13 :
This table shows how often a conclusion with 95% statistical significance is made for comparing the various approaches.We used different sample sizes of 100, 500 and 1000 sentences for each of the approach on English-Vietnamese and English-German low resource NMT.