Section 3 responds to three of the research questions (RQ1, RQ2, RQ3) defined in Section 2. The section is divided into three segments. The first segment discusses various approaches chronologically used for resume parsing and answers the 3RQs. Considering the inclusion criteria set in Section 2, the second segment is dedicated to algorithms and frameworks to improve the backend parsing process. Similarly, the third segment throws light on some auxiliary techniques to upgrade parsing.
3.1 Approaches to Resume Parsing in chronological order
Information Extraction (IE) from resumes with high recall and precision is a tough task owing to their heterogeneity due to factors like different formats, file extensions and writing style as well. One suitable method to extract information from resumes was proposed by (Yu, K., Guan, G. & Zhou, M., 2005) which constitutes a two-pass Hybrid Cascaded Model (HCM). In the first pass, Hidden Markov Model (HMM) was applied to extract general information by segmenting the resume into several blocks followed by annotating each block with a label.
Since HMM is a state-based model, extraction of information fields holding a strong order of sequence becomes much easier. In the second pass, detailed IE is carried out using Support Vector Machine (SVM) classifier within the boundary of each block as classification of information becomes viable if it is independent in nature as in the case with the segmented blocks except for the case with “Education” block. Since most resumes have the same sequence in their Education Section, HMM performs better in this case. SVM is used as the classifier due to its sturdiness to overfitting and high performance (Sebastini, 2002). Good Turing smoothing is applied in HMM for parameter estimation & a back-off schema given by Katz in 1987 is used for probability estimation as sparse training data could be a big hurdle (Gale, 1995; Katz, 1987). “One vs All” multi-class classification strategy is applied in the SVM model as it is a binary classification model by nature. Block Selection for the second pass is done using a fuzzy selection strategy to avoid non-boundary blocks being labelled as boundaries and to enlarge the search scope. Since HCM is a pipeline framework, the chances of error propagation from one pass to the next are very high. (Kopparapu, S. K, 2010) proposed a one pass alternative in which six major segments of a resume are extracted as described by the HR-XML Consortium. IE is done using the N-grams NLP model & a combination of heuristic rules are applied for extracting other segments. One of the major issues faced while doing IE using a predefined knowledge base is the creation of a knowledge base since it is time-consuming if done manually. A modification of the BASILISK algorithm (Thelen et al, 2002) proposed by (Pawar et al, 2012) which is unsupervised in nature works perfectly for domain-independent Named Entity Extraction (NEX) for automated gazette (knowledge base) creation. One such modification includes the addition of negative features to bring more clarity. As an example, depending upon the domain, “Role” and “Designation” might have the same meaning, or maybe not. This introduces ambiguity to the algorithm. A custom algorithm is also introduced for NER, where n-level indexes are prepared for the most important words of each named entity to leverage importance. These indexes are further used to map the occurrences with the gazette. This approach is much better and accurate than naïve regular expressions (regex) based methods such as (Kopparapu, S. K, 2010).
“Semantic Web” is a term coined by Tim Berners Lee and is an extension of WWW. It is a set of standards that lets internet data become machine-readable. Ontology is one of the pillars of Semantic Web standards and some of the ontology languages are Web Ontology Language (OWL) & Resource Description Framework (RDF). (Celik et al, 2012) incorporates Ontology Knowledge Bases (OKBs) to store various domain ontologies for each type of resume segment. These ontologies contain <Literal xml: lang> tag to deal with resumes in multiple languages. EXPERT uses ontology mapping to map resumes with job criteria using a custom mapping & similarity function (Kumaran et al, 2013). Another approach performing ontology mapping stores the resume in RDF graph database using a NoSQL data model & uses SPARQL for querying the database (Abirami et al, 2014; Bojārs et al, 2007). For implementing an ontology-based parser in J2EE, OWL API by Apache Jena can be utilized (Mohamed et al, 2018). Although still valid, the concept of the Semantic Web is deprecated since the rise of Artificial Intelligence (AI).
Initially and even now, there are many recruiting websites and software that provide an online form to fill out basic details like personal information, education details, certifications, work experience etc. This meta-data can be automatically analysed, but most of the candidates skip this part and directly upload their resumes making screening of resumes mandatory. PROSPECT proposes a well-defined architecture for screening resumes (Singh, A., Rose, C., Visweswariah, K., Chenthamarakshan, V., & Kambhatla, N., 2010). The good aspects of this system include the use of shingling (Broder, A. Z.,2000) for detecting identical/plagiarised resumes, linear-chain Conditional Random Fields (CRFs) for IE & mining of named entities, TabClass & ColClass SVM classifiers for segregating multiple tables in a resume and further identification of columns in the table, respectively. CRFs perform better than HMMs in a linear chain model as they act as conditionally trained HMMs and avoid the label bias problem (Lafferty et al, 2001). But ColClass classifier is not applicable for 1-D tables having just rows. Also, data normalization here is dealt with using string-matching techniques which is quite rudimentary. The initial data annotation of the training set is conducted manually, and no automated alternative is defined. Segmentation and attribute extraction is done using multiple feature types like lexicon features, visual features etc. and still, IE is not satisfactory. Only tabular data is properly extracted. Scoring of resumes is done using Term Frequency – Inverse Document Frequency (TF-IDF) technique. Another framework proposed by (Farkas et al, 2014) also uses multiple feature types but the improvement here is the use of both manual and automatic annotation using a self-developed custom tool. A two-level annotation scheme is considered for IE due to the presence of complex data structures. Maximum-Entropy Markov Model (MEMM) & CRFs are applied for IE & MEMM is finally employed for use due to substantially less runtime. The process of data annotation can be added using the Dataturks tool & the spaCy library of python can be utilized to perform various NLP tasks such as tokenization, Part-Of-Speech tagging (POS), Named Entity Recognition (NER) etc. (Satheesh et al, 2020). A two-step extraction framework with adaptive segmentation as an intermediary using a basic classifier to obtain semi-structured data with Simple, KeyValue and Complex tags and further IE using Naïve Bayes algorithm works better than PROSPECT (Chen, J. et al, 2015). A standardized text-windows based approach is adopted by (Tikhonova et al, 2019) in which word-embeddings along with their TF-IDF weights are summed to generate a text field embedding which could then act as input to CatBoost classifiers for segmentation of resumes. Word-Embeddings could be constructed using algorithms such as FastText (Joulin et al, 2016; Bojanowski et al, 2017), Word2Vec (Mikolov et al, 2013; Mikolov, T., Yih, W. T., & Zweig, G., 2013) & GloVe (Pennington et al, 2014).
Another approach deals with the concept of specialness/uniqueness to extract special skills from resumes (Maheshwari et al, 2010). This is based on the concept of product selection from e-commerce websites using special features (Maheshwari et al, 2009). It is assumed that a resume follows a two-layer structure when it comes to the skill segment i.e., skill type & value. Initial preprocessing is done on the text documents (resumes) using several rules implemented either manually or by using a combination of lists and hash tables. Then features sets are identified for the skill type and value using a custom algorithm with each feature represented by a tuple. A “Degree of Specialness (DS)” criteria is defined to score each feature from 0 to 1. Features with DS=0 & DS close to 1 are categorized as common and special features respectively. The rest of the tuples are categorized as common cluster features. Finally, a 3-level feature organization is done using a clustering algorithm with the III-level having the special features. A similar approach also extracts skills from the resume, not special features, but uses a skill ontology with more than 13,000 concepts to match skills (Chifu et al, 2017). The key takeaways from this article are the two algorithms proposed: one for generating lexicalized contexts preceding a known skill using manually identified POS patterns to identify new skills unknown to the ontology, the other for suggesting taxonomic parents for the new skills detected using Wikipedia to store them in the ontology. Similar work is done by (Chandola et al, 2015) where a weighted knowledge base of skills (nouns, verbs & adjectives) is used to compare with the words extracted from the resumes after applying a POS Tagger & a chunker on top of it. The resumes are scored by adding up the weights of the matching skills. Finally, a categorization similar to sentiment analysis is performed based on the score to segment the resumes on a priority basis.
Table 2. Different approaches to Resume Parsing in chronological order
Deep Learning (DL) is a sub-domain of ML which uses Artificial Neural Networks (ANNs) consisting of nodes mimicking the biological neurons. The depth of the ANNs enables them to differentiate between different segments/classes if given sufficient data. Also, since they store the input data in the nodes themselves, a loss of data from the source repository does not tamper with its performance. (Pham Van et al, 2018) applies a Deep Neural Network (DNN) for IE from resumes. In this framework, initial segmentation is performed using a data dictionary of common resume headings to match with the resumes. Dedicated rule-based chunkers for each segment are applied for NER. In order to find more named entities, a DNN comprised of CNN-Bi-LSTM-CRF layers from bottom to top is used. Convolutional Neural Network (CNN) is for word embeddings generation. CNN is efficient in encoding morphological information extracted from characters into neural representations (Dos Santos et al, 2014; Chiu et al, 2016). Recurrent Neural Networks (RNNs) can capture time-dynamics via cycles in the graph. Hence, they should be able to capture far-away dependencies, but they fail due to problems about gradient vanishing/exploding (Pascanu, R., Mikolov, T., & Bengio, Y., 2012). The Long Short-Term Memory (LSTM) model is an upgrade that can manage these issues. Hence, a Bi-directional LSTM (Bi-LSTM) model is used as it is recommended to have ingress for both past and future contexts to aid sequence labelling. Finally, the output vectors of Bi-LSTM layers are passed onto a CRF layer to generate the best possible sequence of labels due to its nature of learning the correlations between the outputs. A Stochastic Gradient Descent (SGD) is used for parameter optimization as it performs better than other optimization algorithms like RMSProp, AdaDelta and Adam. Early stopping can be implemented based on the performance of validation sets at each epoch (Caruana et al, 2001). Dropouts can be applied on both input & output vectors of the Bi-LSTM layer to mitigate overfitting (Srivastava et al, 2014). Another approach for IE using ANNs is done using the following techniques: a CNN model for segmentation using GloVe for word embeddings and CRFs for sequence labelling using CRF++. Bi-LSTM & Bi-LSTM-CNN models are used to compare with the above techniques for segmentation and sequence labelling, respectively. CNN performs better because of a pre-trained GloVe model and since CRFs are undirected in nature, they perform much better than the Bi-LSTM-CNN model because of their ability to access both past and future contexts (Ayishathahira et al, 2018). Job Matching using job descriptions (JDs) and resumes can be achieved using just CNNs through a deep Siamese network (Maheshwary et al, 2018). This approach uses a pair of CNNs with max pooling, repeating convolutions and leaky rectified linear unit layers covered by a fully connected layer on the top of the network. It helps to accurately obtain the underlying semantics by pushing away dissimilar resumes and JDs and projecting similar ones closer in the semantic space. LSTMs can also perform the same task but with much more computational cost. Parameter sharing in the siamese network reduces computational time. Doc2Vec model is used to generate word embeddings as the input for the network.
3.2 Algorithms & techniques to improve the backend process of parsing
It is now evident that segmenting and labelling sequence data is essential for parsing resumes. HMMs & stochastic grammars used for segmentation & labelling make strong independence assumptions owing to their generative nature. They define a joint probability over label sequences and observations which makes the models unmanageable due to the impracticality of representing long-range dependencies of the observations or multiple interacting features. Maximum Entropy Markov models (MEMMs) are better than generative models but they along with other finite-state models suffer from label bias problem causing bias towards states with few successor states. The CRF framework works better than HMMs & MEMMs since it has a single exponential model for joint probability over label sequences, provided the observation sequence (Lafferty et al, 2001). Word vector representations were initially achieved using models like Latent Dirichlet Allocation (LDA) & Latent Semantic Analysis (LSA) further replaced by Feedforward Neural Net Language Model (NNLM) & Recurrent NNLM (RNNLM). (Mikolov et al, 2013) proposed two models: Continuous Bag-of-Words (CBOW) & Continuous Skip-Gram model (Skip-gram) having much more semantic and syntactic accuracy than NNLM and RNNLM models. CBOW predicts the current word based on context while Skip-gram predicts the neighbouring words based on the current word.
ML algorithms require some sort of text representation as input since text cannot be directly fed to the algorithms. One such method of text representation is the fixed-length Bag-of-words (BoW) model. But BoW loses a lot of information like word order and sentence grammar. (Mikolov et al,2014) proposed an unsupervised algorithm, Paragraph Vector, that is trained using stochastic gradient descent and backpropagation. It utilizes text of variable length to learn fixed-length feature representations. Paragraph Vector outperforms the bag-of-word model by about 30% on a text classification task.
A huge amount of feature engineering and lexical information is required for efficacious NER. Taking inspiration from (Collobert et al, 2011), a novel neural network-based architecture is proposed to minimize the need for the same (Chiu et al,2016). The model incorporates a character and word-level features-based hybrid Bi-LSTM and CNN model to attain state-of-the-art performance for NER, minimizing the need for feature engineering. Similarly, a DNN architecture named CharWNN employs character and word-level representations to perform POS Tagging (Dos Santos et al,2014). State-of-the-art results are achieved using a convolutional layer to extract character level features without using handcrafted features.
Table 3. Advanced algorithms and frameworks used in Resume Parsing
3.3 Auxiliary techniques to aid parsing
Table 4. Auxiliary techniques to aid parsing
This segment presents some techniques that could facilitate the parsing of resumes in a better way with a bit of effort to obtain better results. Normalizing academic organization names would improve the parsing process by efficient clustering of candidates from the same academic background and a better understanding of the market dynamics in terms of the pay scale of candidates in the same college/university. sCool is a framework designed for CareerBuilder (CB) to achieve the same (Jacob et al, 2014). It is achieved using two major steps: firstly, the database is initialized for normalization by creating a mapping from names using MediaWiki API and the existing CB database. Then the mappings are merged, and duplicate & invalid mappings based on similarity measure are removed followed by indexing of the valid mappings using Lucene (McCandless et al, 2010). The second step is to perform normalization on the institute names. This step involves the removal of unwanted or invalid names using the J48 Classifier developed using Weka (Hall et al, 2009). Normalized institutes can be filtered out using sCooL’s search query efficiently as it allows the user to select from a range of string-comparison algorithms such as N-gram, Jaro Wrinker, Levenshtein etc. plus a combination of algorithms. A combination of ignoring case equals and Lucene Levensthein works best with their system. A concept similar to job recommendation which is used to suggest suitable jobs to job-seeking candidates can be used by the recruiters as well to classify resumes in their database into different domains or job categories so that candidates irrelevant to their field can be removed. This helps in efficient database capacity utilization. (Sayfullina et al, 2017) uses job descriptions for training the dataset and tests it on a set of resume summaries using a fastText classifier and a custom CNN model. fastText is used due to its outstanding performance without utilizing GPU and reportedly better performance than DNNs like CNN and char-CNN. The custom CNN used receives input in form of a matrix formed using concatenating word vector representations (using word2vec model) by rows. In order to capture the most important feature max-pooling is applied after the convolution. Finally, a soft-max layer is used to obtain a probability distribution over classes. The custom CNN model outperforms the fastText classifier in predicting the job domains. An alternative method performs parsing on resumes using Stanford CoreNLP, regex, and pattern-matching operations (Mittal et al, 2020). TF-IDF is used to extract features from the parsed information and logistic regression is applied to assign a job domain to the resumes. The classifier is trained using a manually curated training set of skills along with their job domains. There is hardly any research in the field of resume visualization. Extracted information from resumes if visualized, can make the job of recruiters much more diverting. An attempt at visualizing resumes on government officials is made by ResumeVis (Zhang et al, 2018). This visualization tool displays three major graphs: a statistical histogram for comparing career trajectories using naïve Bayes classifier; an ego-network based spiral graph for displaying interpersonal relationship among candidates using Apriori algorithm for mining frequent resume sets from a basket dataset of resumes and organizations, a custom matching algorithm to measure similarity among the resumes in the set and cosine similarity to compute implicit relationships; & an organizational individual mobility map among various sectors facilitating the hiring decision.