Coherence based automatic short answer scoring using sentence Embedding

Automatic essay scoring is an essential educational application in natural language processing (NLP). This automated process will alleviate the burden and increase the reliability and consistency of the assessment. With the advance in text embedding libraries and neural network models, AES systems achieved good results in terms of accuracy. However, the actual goals are not attained, like embedding essays into vectors with cohesion and coherence, and providing feedback to students is still challenging. In this paper, we proposed coherence-based embedding of an essay into vectors using sentence-BERT (Bidirectional Encoder Representations from Transformers). We trained these vectors on Long Short-Term Memory (LSTM) and Bi-LSTM (Bidirectional Long Short-Term Memory) to capture sentence connectivity with other sentences' semantics. We used two different datasets; one is standard ASAP Kaggle, and another is a domain-specic dataset with almost 2500 responses from 650 students. Our model performed well on both datasets, with an average QWK (Quadratic Weighted Kappa ) score of 0.76. Furthermore, we achieved good results compared to other prescribed models, and we also tested our model on adversarial responses of both datasets and observed decent outcomes.


Introduction
Automated essay scoring (AES) evaluates student responses written for a prompt to reduce human effort and assure consistency in scoring.Most researchers have worked on the ASE systems in the recent past.From Ramesh, D., Sanampudi, S.K (2022), from Table 1, all these approaches are categorized into four classes based on the combination of manually and automatically extracted features and used machine learning and neural networks models to train the essays.The early systems like Page, E.B (1966), Ajay et al. (1973), Burstein Cummins et al. (2016).Manually extracted features like a bag of Words (BoW), term frequency, inverse document frequency, word count, sentence count, and sentence length.They used machine learning models like regression, support vector machine, etc., to nd the relationship between essays and labels with manually extracted statistical features.Nevertheless, these approaches did not capture semantics and content for the evaluation of an essay.

Table 1 comparison of machine learning and neural network models and features extraction methods
BoW/Tf-Idf Word2vec/Glove (Word embedding) USE (sentence embedding)

Regression models/classi cation models
The system implemented with Bow features and regression or classi cation algorithms will have low cohesion and coherence.
The system implemented with Word2vec features and regression or classi cation algorithms will have low to medium cohesion and coherence.
Sentence based encoding will capture coherence from sentence, but with regression models sentence to sentence to connectivity will miss.
Neural Networks (LSTM) The system implemented with BoW features and neural network models will have low cohesion and coherence The system implemented with Word2vec features and neural network model (LSTM) will have medium to high cohesion and coherence Sentence based embedding, with neural network (LSTM) will capture cohesion and coherence.But USE embeds essay into 512 dimensions.
In the second category of approaches, Sultan et  Contribution.
This paper mainly concentrates on feature extraction without missing coherence and cohesion from an essay.We used a recurrent neural network to capture sentence connectivity to give the nal score.
We developed an AES system based on sentence-level embedding to capture sequential features, which ne-tunes the relevance and semantics of individual essays to give the nal score.
We evaluated our system on two data sets to prove our model's robustness.One is the standard data set, and another is a new dataset on the operating system domain created by us, which is publicly available.Our approach outperforms existing AES-based approaches.
We demonstrated the effectiveness of our approach through experimental evaluation on various adversarial responses.Our approach signi cantly outperformed. Organization.
The rest of the paper is organized as follows.Section 2 discusses the related work about text embeddings and deep learning models used for AES systems and their challenges.Section 3 presents the proposed method of the AES system on different datasets and sentence embeddings.Section 4 discusses the implemented models, their architectures, and the hyperparameters used during training.Section 5 discusses the experimental results compared with other models and presents the model's performance on adversarial responses.Finally, section 6 discusses the conclusion and future work.performed well in terms of accuracy.However, no model showed the model's robustness and how the system handles adversarial responses.

Related Work
Table 1 illustrates the possible combinations of text embedding and machine and deep learning models used for essay scoring.Table 2 shows the text embedding technique and vector dimension they create; out of all sentence-BERT will give a low dimension vector for sentence.The best combination is sentence embedding and recurrent neural network because sentence embedding will capture coherence from an essay and can easily handle polysemous words, which was a de ciency in word embeddings.

Method
We proposed a different approach that implements a sentence-based text embedding to capture coherence and trained an LSTM and Bi LSTM separately.We used two data sets; one is a standard ASAP data set; another is domain-speci c collected from 600 students as an assignment, a total of 2300 responses.Moreover, we tested our model on different types of adversarial responses to check the system's robustness.The detailed architecture of the proposed AES system with sentences BERT (Devlin et al., 2019) and Bi-LSTM is illustrated in Fig. 1.

data set
We used the ASAP Kaggle data set, widely used in AES systems.ASAP dataset comprises 12,978 essays of 8th to 10th standard students on eight different prompts.Each prompt consists of 1500 and above essays evaluated by two raters.Prompts 3,4,5,6 are source-dependent essays, and the remaining are others.A detailed description of the essay dataset is presented in Table 3.In addition, we created new data set on the domain operating system (OS) to test the performance of AES systems on domain-speci c essays.We framed ve basic questions as an assignment from operating systems, a computer science subject, and distributed them to students of various engineering colleges.We got 2981 responses from students after eliminating repeated or multiple responses.Finally, we left with 2390 responses from 626 students.Two subject experts evaluated the new dataset on 0-5, the minimum score is 0, and the maximum score is 5.We used the QWK score to measure the agreement between the two raters.The resulting score is 0.842(QWK).A detailed description of the OS dataset is demonstrated in Table 4.

Model
An essay is a sequence of sentences jointly justifying the prompt.So, the implemented model should consider extracting coherence, cohesion, and linguistic features from an essay to give the nal score.So, we embedded all essays, sentencewise, without removing stop words to capture coherence from the essay.Then, we implemented LSTM and Bi-LSTM separately to train the sentence vectors.Moreover, we also implemented a CNN + LSTM model like Taghipour& Ng (2016) will capture N-gram features and then transform them into an LSTM unit.
LSTM/Bi-LSTM LSTM and BI-LSTM are recurrent neural networks that can process a sequence of information with their memory cell.The memory cell consists of an input gate (2), forgot gate (1), an output gate (3), and a context gate (4) to process the information and store the long-term dependency information required for feature use and that is passed to the next cell input gate.
Bi-LSTM traverses the sentence vectors in both directions (1)(2) to capture coherence.From each sentence, context information will be stored and attached to the following sentence from these combined sentences; again, some info is stored, like all sentences will summarize and predict the nal score for the essay.
Where w (1), w (2) …w(t) weights, b is bias, σ is activation function, y is output, and t is the number of sentences in the essay.

Implementation and Training RNN
First, we embedded all the essays into vectors using sentence BERT, then padded all the essays into the full-size essay, i.e., 96*128 like in Fig. 2. Then we converted all vectors to 3-dimensional vectors to train on a neural network.
In  In each fold, we calculated the QWK score.Finally, we use the model that achieves the best performance on training data to predict the test data.Figure 3 illustrates the training and validation loss of the proposed model, and it portrays that our proposed model is neither over tted nor under tted.
We used the same hyper parameters and 5-fold cross-validation to train sentence-LSTM and sentence-Bi-LSTM on the OS dataset.OS data set input dimension is 23*128 for LSTM and Bi-LSTM, 23 is the maximum number of sentences, and 128 is the sentence vector.

Result Analysis
The experiment results we obtained based on the ASAP dataset and OS dataset for the AES system are shown in Fig. 3,4; we observed that our proposed model performance on both datasets is best tted.We trained the model prompt-wise and calculated training and validation loss prompt-wise.
The results show that our proposed models outperformed and equally performed with other models.The comparison of all baseline models on the ASAP dataset and our proposed models on average QWK score is shown in Table 7.We found that sentence embedding-LSTM and sentence embedding-Bi-LSTM models performed well compared to other models and were consistent with the human rater score.Furthermore, it observed that Sentence Embedding-LSTM and Bi-LSTM performed better than models like Muangkammuen, Panitan, and Fumiyo Fukumoto (2020), Agrawal, Aman, and Suyash Agrawal (EASE) ( 2018 2017).models performed equally with sentence embedding-LSTM and Bi-LSTM.However, these integrated and word embedding models did not capture sentence coherence; because of neural networks, the QWK score was high.However, like other models, our model consistently performed on source-dependent essay traits like prompt 3,4,5,6; these essay traits' rating range is between 1 and 6.Furthermore, based on the QWK score, the performance of our models and other baseline models on persuasive, narrative, and expository essay traits is the same and a little bit high.However, the performance was reduced when we used the CNN layer on the Sentence Embedding-LSTM, Bi-LSTM neural networks model; with this, we conclude that when we split essays or sentences into tokens, the coherence and cohesion were missing.Though The models trained on the ASAP dataset were also used on the OS dataset with the same hyper parameters, and it is performed with an average QWK score of 0.746 and 0.751 by Bi-LSTM and LSTM models.

Testing on Adversarial responses
To know how models perform on adversarial responses and to comprehend whether the essays are evaluated based on the semantics.We prepared eight test cases on adversarial responses like irrelevant, relevant responses, prompt as a response, and repeated sentences.Table − 9 and 10 compares actual and predicted scores of word embedding (word2vec) and sentence embedding models on all eight test cases.From test cases 1,2,3, the difference between the actual and predicted scores of the word embedding model is high.On the other hand, the sentence embedding model performed well.
In tests cse-5 and 6, the word embedding model underperformed when we tested irrelevant responses with few words matched with content.On the other hand, in sentence repeated responses, test cases-5, and six capture the semantics of the essay and do not consider duplicate sentences while providing the nal score.For example, from Table 10, test cases 1,3 are irrelevant responses, test case 2 is one sentence response, test case 4 is 50% relevant, and the remaining irrelevant response.In test case 5, the sentences are repeated.In all these cases, our model performed well.
We observed that the word embedding model is underperformed and does not capture essay semantics.In contrast, Sentence Embedding with LSTM performed well on irrelevant responses, given a nal score based on relevance.
We strongly argue that our model captures coherence and cohesion from essays while evaluating and assigning a score.We used sentence-to-sentence embedding and trained neural networks to capture sequence-to-sequence patterns from the essay.However, our model's average QWK score is greater than or equal to other baseline models, but our model assesses essays based on coherence and cohesion.Moreover, our model performance is consistent in semantics and relevance while testing adversarial responses.The os assigns a number of tasks to a cpu and perform all the tasks one by one using various kinds of priorities.The way of deciding the priorities are the scheduling algorithms.There are 2 types of scheduling algorithms and they are preemptive and non-preemptive.And there are 4 ways for deciding the priorities.They are rst come rst service, shortest job rst, priority scheduling, round robin.Based on the requirements the type of scheduling is allotted and the processes get executed.The os assigns a number of tasks to a cpu and perform all the tasks one by one using various kinds of priorities.The way of deciding the priorities are the scheduling algorithms.There are 2 types of scheduling algorithms and they are preemptive and non-preemptive.And there are 4 ways for deciding the priorities.They are rst come rst service, shortest job rst, priority scheduling, round robin.Based on the requirements the type of scheduling is allotted and the processes get executed.Explain how operating system handle multiple tasks at a time? 4 5.1

Conclusion
In this paper, we proposed and implemented a novel approach for an AES system with a combination of sentence embedding with an LSTM and Bi-LSTM.All the models are trained and tested on the Kaggle ASAP and OS datasets.In this approach, we embedded the essay sentence by sentence after preprocessing to capture sequence-to-sequence coherence patterns and trained on recurrent neural networks.Speci cally, we compared our models: Sentence embedding-LSTM and Sentence Embedding-Bi-LSTM with baseline models of AES; it has been observed that Sentence Embedding-LSTM and Bi-LSTM performed well among all models.Moreover, our proposed models outperformed other baseline models without missing coherence.Some of the model's performance is high in terms of QWK score, but so far, all models have used word-based embedding.

Figures
Figures

Figure 1 Architecture of AES system Figure 2
Figure 1 Dong & Zhang (2016)h categories of approaches would have extracted features manually, automatically, and trained different types of neural network models.Like Taghipour& Ng (2016),Dong & Zhang (2016), Riordan et al. (2017), Mathias & Bhattacharyya (2018), Wang et al. (2018), Dasgupta et al. (2018), Kumar et al. (2019), and Wilson Zhu and Yu Sun (2020) used pre-trained natural language processing models like word2vec, glove to extract features.Then, they used neural networks CNN (Convolution Neural Network), RNN (Recurrent Neural Network), and a combination of CNN and RNN to ne-tune the essays.These approaches performed well in terms of QWK score.However, the word-level feature extracting methods cannot handle polysemous words, and they miss sentence semantics and connectivity from an essay.Moreover, no AES model has proved the model's robustness by testing adversarial responses for consistency.From Horbach A and Zesch T (2019), Riordan Brain et al. (2019), Ding, Yuning, et al. (2020), and Kumar, Y. et al. (2020) proved that the black box type of models is prone to adversarial responses and its challenging task to handle irrelevant, repeated sentences type of responses.So, we need an AES system to handle adversarial responses and evaluate essays based on content.
al. (2016), Contreras et al. (2018), Darwish & Mohamed (2020), and SüzenNeslihan et al. (2020) extracted content-based features with Word2vec Mikolov, Tomas, et al. (2013), and Glove Jeffrey Pennington et al. (2014).However, again, they used a machine learning model for assessment.The machine learning model considers the feature vectors independently, and they do not connect the words to capture the semantics of the essay.
Mathias & Bhattacharyya (2018)s & Bhattacharyya (2018)added an extra CNN layer on top of the LSTM or Bi-LSTM to form sentences.The CNN layers will add a xed number of words to form sentences.However, with this approach, the model's accuracy has been increased, the actual sentences have diverged, and actual word connectivity is missing.Still, do not handle irrelevant responses like one-word, random-word responses.Kumaret al. (2019) implemented a domain-based AES system based on word embedding.In recent researchers embedded essays directly sentence by sentence using USE(Universal Sentence Encoder and sentence-BERT like Pedro Uria Rodriguez et al. (2019), Jiaqi Lun et al. (2020), Yang, R et al. (2020), Song, W et al. (2020), Ormerod, C. M et al. (2021), Doewes, A., &Pechenizkiy, M. (2021).May eld, E., and Black, A. W. (2020) used BERT for essay embedding, but they used word level embedding instead of setence,and trained an LSTM model for ne-tuning.USE and sentence-BERT capture coherence and cohesion from an essay, and the LSTM model will tune sentence connectivity and coherence to assign a score.Fernandez N et al. (2022) used BERT for reading comprehension and trained separately on prompt, text, and student response for assessment.Wang Y et al. (2022) used BERT and LSTM and

Table 4
Speci cally, not able to handle polysemous words.Moreover, sentence embedding techniques are there, converting word vectors into sentence vectors by taking an average of all words.So, in our model, we used sentence BERT to convert essays into vectors.Sentence BERT will convert text into vectors dynamically with context and semantics, and it can reconstruct the original sentence from the vector.First, we removed all special symbols like (@, #) from an essay and tokenized in into sentences.In the ASAP and OS data sets, 96 and 23 maximum number of sentences are in each data set, respectively.Then, we used a pre-trained transformer model, i.e., Sentence-BERT, to embed each sentence into a 128-dimension vector.So, we have 96*128, 23*128-dimension vectors for an essay of ASAP, OS data sets.Finally, we padded all the essays into 96* 128 and 23 * 128 vectors, where 96 and 23 are the maximum number of sentences in ASAP and OS datasets.Table5will illustrate the sentence vectors of an essay from both datasets.
In natural language processing, converting text into a vector with context and semantics is challenging.The embedding techniques like word2vec and Glove convert text into vector word by word, but they do not consider surrounding words and their static.
Dong et al. (2017)e layers of LSTM; each unit has an input gate, output gate, and context gate.We used an RMSprop optimizer to reduce the mean square error likeDong et al. (2017), drop rate as 0.5, assign initial learning rate as 0.001, and activate function as ReLU.The hyperparameters of our models are shown in Table6.

Table 7 .
Prompt wise QWK score with LSTM and Bi-LSTM on ASAP dataset Table 8 illustrates the promptwise QWK score and average QWK score of the word embedding, Sentence Embedding-LSTM, and Sentence Embedding-Bi-LSTM model on the OS dataset.

Table 9
Testing and comparing Results of proposed model and word embedding model on adversarial responses plan or set of rules.You can use system to refer to an organization or institution that is organized in this way.The Operating system determines which task the cpu will work on at any given time, pausing tasks as needed, so that all the tasks as completed as e ciently as multiprocessor operating system use a single cpu to work ona number of tasks.