This section describes the methodology underlying our model. Figure 1 is a pictorial summary of the goals of our project.
Basic Steps :
• Convert all formats into a single .txt format
– Convert scanned handwritten documents pdf into digital text
– Convert digital text pdf to .txt format
• Compute features that would aid in getting the idea of similarity/plagiarism
– Cosine Similarity
– Cosine Trigram Similarity
– Jaccard Similarity
– Longest Common Subsequence
– Sequence Matcher
– Universal Sentence Encoder
– BERT Sentence Embeddings
– Word2Vec Spacy Sentence Embeddings
– Rabin Karp
– NLTK Phrase Tool
– NLTK Docsim Took
– N-gram Containment Tool
• Using similarity features to train models
• Compute degree of online plagiarism
• Compute degree of real plagiarism among documents
• Compute degree of semantic similarity among docu- ments
• Recognizing textual entailment among documents
DATASETS AND WORK DONE :
A. AI dataset
These are class notes compiled by two batches in the Arti- ficial Intelligence course. There are two formats, handwritten and digital. Each student made four submissions, and on average, each submission was 2500 words long. This helped us in the analysis of semantics, as every student explained the same concept but in different words.
B. DBMS and ADA Notes
Handwritten notes were provided by the author (Figure 2). These notes were interesting since they contained various different formats, having different diagrams and notations, and variations in ink color. Thus, these give us the necessary variety to judge the accuracy of Microsoft API.
C. MSRPC Dataset
This dataset [8] is provided by Microsoft and is known as Microsoft Research Paraphrase Corpus. It consists of 5800 pairs of sentences, along with a label to indicate whether a pair of sentences captures a paraphrase/semantic equivalence relationship or not.
D. SICK Dataset
It consists of 10,000 pairs of sentences along with a sen- tence relatedness score and a label to indicate the entailment relation between two sentences. Sentence relatedness score is labeled on a 1 - 5 rating scale. While for entailment relation, the sentences are labeled in three different classes “contradiction”, “entailment”, and “neutral”. This dataset [9] provides a measure of semantic similarity on a continuous scale and additionally serves as the dataset for entailment analysis.
E. STS Dataset
This dataset [10] considers the semantic similarity of nearly 8000 independent pairs of texts and shares a precise similarity metric definition of assigning a number between 1 to 5 to each pair denoting the level of similarity.
F. University of Sheffield Dataset
This dataset [11] is provided by the University of Sheffield. 19 students volunteered to prepare the dataset. There were four levels of plagiarism: Near Copy, Light Revision, Heavy Revision,Non Plagiarism.
G. Converting Handwritten Scanned Documents into Digi- tal Text
1) Extracting images out of PDF: The pdf was divided into smaller chunks, and then each chunk was processed separately. RAM was flushed after processing each chunk to increase operation speed. Images are extracted from pdf chunks using pdf2image and the Poppler library.
2) Extracting text out of images using Microsoft API: Text is extracted from images (extracted in the previous step) using Microsoft API [12] and is appended to a text file. We got good results: around 70 percent of sentences were exact matches, and the rest had minor mismatches, so we need further improvements.
3) Improvement in the extracted text: The text obtained from API is further improved using different libraries like Textblob [13] and Symspell [14] library.
• TextBlob Library :
It is based on Peter Norvig’s “How to Write a Spelling Corrector”. The code generates two different sets of words for an incorrect word; the first by making only one edit in the incorrect word and the second set by making two edits. These words are further narrowed down by checking their existence in the dictionary. The final word to be replaced is chosen with the help
of combined probability, the first being the probability of the candidate word appearing as a word in English Text and the second probability that the incorrect word in question would have been typed when the author wanted to write the candidate word.
• Symspell Library : The Symmetric Delete spelling correction algorithm[6] reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of mag- nitude faster (than the standard approach with deletes
+ transposes + replaces + inserts) and language inde- pendent. Please refer to the supplementary material for a detailed description.
4) Conversion of the text file to PDF: Finally, the text file is converted to a searchable pdf using the txt2pdf library.
H. Features chosen as Similarity Measures
A detailed description of the measures is provided in the supplementary material. Please refer to the supplementary material for more information.
• Bert Sentence Encoder
Used Google’s Bert [15] for encoding two documents into vectors and then finding cosine similarity between two vectors. This feature helps us understand the semantic similarity between the texts.
• Cosine-1
This feature gives the similarity on the basis of matching words. If we consider u and v as the concerned vectors, then the feature would be calculated as ⟨u, v⟩
• Cosine-2
This feature gives the similarity on basis of matching words.
• Cosine Trigram
This feature gives the intuition of similarity by dividing the texts into trigrams and then their matching cases.
• Docsim nltk
This feature helps us in identifying direct cases of plagiarism as a function of seam distance.
• Word2Vec Spacy Embedding
Word2Vec utilizes vector representations of words, “word embeddings”. However, plagiarism and similarity are more about sentences and paragraphs rather than just words. Therefore, we calculated the mean vectors of sentences and compared their vectors. We tested several different threshold values and found a cosine similarity > [0.85-0.90] is a reliable indication of sentences being alike. We got the intuition for this after observing that slightly different small phrases hold the same threshold.
• Jaccard Trigram
This feature gives the intuition of similarity by dividing texts into trigrams and then matching cases based on the generated trigrams.
• Longest Common Subsequence
This feature accounts for actual cases of plagiarism.
• N-gram Containment
An n-gram is a sequential word grouping. Containment is defined as the intersection of the n-gram word count of a Source Text (S) with the n-gram word count of the Student Answer Text (A) divided by the n-gram word count of the Student Answer Text. If the two texts have no n-grams in common, the containment will be 0, but if all their n-grams intersect, then the containment will be 1.
Here, different values of ‘n’ help us analyze the context of inputs with respect to each other. For example, for smaller values of ‘n’, a high containment value would indicate that the semantics are the same for both inputs and that the inputs are aligned toward the same context. However, for higher values of ‘n’, if the containment value is high, that would indicate a direct case of plagiarism.
• NLTK Phrase Toolkit
This feature is a combination of two factors: semantic similarity and word order similarity.
Similarity Score = x*Semantic Similarity + (1-x)*Word Order Similarity
Here ‘x’ represents the weight of semantic similarity, and ‘1-x’ would account for word order similarity. We found the best results at x = 0.6 since this provides a good approximation of similarity and considers both types of features.
• Rabin Karp
First, we filter documents by tokenizing and removing the stopwords. Then we calculate the hash value of each document and add it to the document type hash table. Hash Value is calculated using the concept of rolling hash. We obtained the best results at the value of 3, the reason for this being that this value focuses on phrases rather than words. The rolling hash algorithm is designed such that similar phrases have the same values. Then, we calculate the plagiarism rate using the hash values of both documents. This feature gives a better indication of direct plagiarism between documents.
• Sequence Matcher
Sequence matcher is a tool available in the Python library ‘difflib’. It is used to compute the longest matching character string followed by the computed score after normalization. This feature indicates direct cases of plagiarism between documents.
• Tensorflow Universal Sentence Encoder
The Universal Sentence Encoder [16] encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks. There are two Universal Sentence Encoders to choose from with different en- coder architectures to achieve distinct design goals. The one based on the transformer architecture targets high accuracy at the cost of greater model complexity and resource consumption. The other uses DAN and targets efficient inference with reduced accuracy. We used the model trained with a deep averaging network (DAN) encoder as we have already used BERT as a separate feature, as well as to reduce resource consumption since BERT is quite resource intensive.
I. Checking Plagiarism for documents
1) Real Plagiarism: We worked on this aspect because of its high importance in the real world. “Real plagiarism” refers to the direct copying/no original thoughts scenarios. For this, we had to consider features that can contribute to such a measure. After experimenting with various datasets, we came to the conclusion that the following features with weights, as mentioned in Table 1, give an accurate measure of such plagiarism. The intuition behind these values can be understood as they emphasize on sentence framing aspect and highlight the real cases of plagiarism, which is the primary aim of our design.
Table 1: Feature Weightage
Feature
|
Weightage [in Percentage]
|
Cosine[2]
|
5
|
2-gram
|
15
|
3-gram
|
25
|
3-gram
|
20
|
Docsim
|
5
|
LCS
|
10
|
Sequence Matcher
|
20
|
2) Online Plagiarism: This is an important aspect of plagiarism. Students often take help from the Internet for their assignments, tutorials, and projects. We have followed a systematic approach to identify all cases of such plagiarism. The entire methodology is summarised in Figure 4.
We were thus able to check for plagiarism of the given input across the web[Internet] by first identifying possible sources and then checking each one systematically.
3) Semantic Plagiarism: We tried two different models
• Binary Classification
–Data
We used the MSRPC [8] and the University of Sheffield datasets [11] . Every text file is compared with the source file and is classified as whether it is plagiarized or not. The given data is divided into test and training datasets, and further processing is done. We also used this model on our AI dataset.
–Model
We used a binary classification model designed using PyTorch. The input to the model are the similarity features that we used to measure similarity among documents and the output is a single sigmoid value that is rounded to the label 0 or 1, classifying the plagiarism. 1 here indicates that the document is plagiarized, while 0 indicates it is not plagiarized. We tried different combinations of nodes and hidden layers, and we got the best accuracy with the architecture shown in Figure 5. Dropout was set at 0.2. The learning rate was set at 0.001.
– Activation Function : Rectified Linear Unit(ReLU)
– Optimizer : Adam
– Loss Function : Binary Cross-Entropy loss (BCELoss)
• Regression
–Data
Every text file is compared with the source file and is given a score on a continuous scale of 1 to 5 for the similarity between the two files. We trained and tested our model on the SICK dataset. The given data is divided into test and training datasets in a ratio of 20:80, and further processing is done to remove any inconsistencies from the dataset. STS dataset [10] was used for validation of the model.
–Model
We used a regression model designed using Py- Torch. The input to the model is the similarity features that we use to measure similarity among documents, and the output is a single value on a scale of 1 to 5, indicating the similarity between documents. We experimented with different com- binations of nodes and hidden layers and learning rates and got the best results for the architecture, as shown in Figure 6. The best results were obtained when the learning rate was set at 0.005.
– Sigmoid is being used as the activation function.
– Stochastic Gradient Descent was used as the optimizer.
4) Text Entailment:
• Multi-Class Classification
–Data
Every text file is compared with the source file and is classified into three different classes: “Contradiction”, ”Entailment”, and “Neutral”, depending on two files. We trained and tested our model on the SICK dataset.
–Model
We used the Multiclass Classification Model defined using Keras. The input to the model is the similarity features that we use to measure similarity among documents, and the output is a single class of the three classes “Contradiction”, “Entailment”, and “Neutral”, indicating the text entailment between documents. We experimented with different combinations of nodes and hidden layers and the final model is shown in Figure 7.
– Activation Function : Rectified Linear Unit (ReLU) in the hidden layer and Softmax in the output layer
– Optimizer : Adam
– Loss Function : Categorical Cross Entropy
–Entailment Analysis Data Examples:
* Text 1 : A man with a jersey is dunking the ball at a basketball game
* Text 2 : The ball is being dunked by a man with a jersey at a basketball game
Entailment Analysis
Result : Entailment