Annotation and Classi cation of Graphs of Property Values Reported in Material Science Literature

pdf Figure 1: Example of classification process of graphs dataflow.pdf Figure 2: Overview of dataset construction fig-sample1.pdf Figure 3: Examples graph images 1 (a)[19], (b)[20], (c)[21], (d)[20], (e)[22], (f)[23], (g)[24], (h)[25], (i)[26], (j)[21], (k)[27], (l)[28] fig-sample2.pdf Figure 4: Example graph images 2 (a)[20], (b)[29], (c)[30] one-multi.pdf Figure 5: Sample of “one-graph figure” and “multiple-graph figure” (a)[31], (b)[32] Iinuma et al. Page 20 of 20


I. INTRODUCTION
Given the emergence of data science and machine learning in materials science, increased importance is placed on obtaining data.Data in materials science are particularly heterogeneous due to the significantly wide range of classes of materials and the variety of material properties.In recent studies, several extraction tools have been applied based on natural language processing from the literature [1]- [3].However, these data may appear in several ways in the literature, such as tables and figures, which are often ignored in natural language processing, although they contain a significant quantity of data on experimental measurements of material properties.Accordingly, we focus particularly on graphs, which are figures graphically representing numerical data, such as a bar graph and line graph, because they are often employed to show the evidence supporting the main claims in materials science literature and thus often include the most important information in the literature.
Several methods have recently been reported to automatically consume and codify information in scientific literature figures across domains, such as image classification and optical character recognition [4], [5], based on techniques adapted from the computer vision field.These methods have immense potential to obtain data necessary for data-driven materials research based on the literature.Furthermore, there have been several studies on identifying the characteristic values from figures and tables in the literature using machinelearning-based approaches; for example, parsing result figures [6], extracting values from tables [7], classifying figures of biomedical articles on five predefined figure types [8], and quantifying data from microscopy images [9].However, in the field of materials science, there is no reported general method for identifying the characteristic values of graphs.
In this study, as the first step toward automatic information VOLUME 4, 2016 extraction from graphs, we focus on automatically extracting and classifying graphs in the materials science literature.To achieve this, we construct a dataset that classifies graphs according to types of conditions.We define new conditions for graph classification based on the examination of actual graphs reported in papers.We annotated the types based on the images and captions.To classify annotated images, we propose deep-learning-based classification models that utilize multimodal information of graphs, consisting of graph images, text in graphs, and captions.We first prepare baseline models for each modal information, which are popularly employed as baselines for natural language processing and image processing.We then consider two models to combine the unimodal information: one integrates the feature representations of unimodal models, while the other aggregates the prediction results of unimodal models.The best multimodal model classified graphs with a micro-F1 score of 96.1% uses the proposed dataset, which represents a better performance than unimodal models.
The primary contributions of this study can be summarized as follows: Automatic extraction of information related to the properties of materials from the literature is an important task in the field of materials science.Therefore, many studies related to automatic extraction have been conducted.Many of those studies utilize techniques based on natural language processing.
As a first step in extracting material information from the literature, Mysore et al. [1] constructed a dataset of 230 material synthesis processes tagged to scientific literature.Then, Kononova et al. [2] generated a dataset of "codified recipes" for solid-state synthesis automatically extracted from scientific publications.Kuniyoshi et al. [3] proposed a system to extract the synthetic process for all-solid-state batteries from the scientific literature by a deep learning-based sequence tagger and simple heuristic rule-based relation extractor.
The above studies used text in the literature to extract the material property information by natural language processing.They could not extract the material property information from figures, which contain a significant quantity of data on experimental measurements of material properties.

B. FIGURE RECOGNITION ON SCIENTIFIC LITERATURE
Existing search engines for academic papers focus on extracting information in text, so they are not good at extracting information from images by figure recognition.However, figure recognition is important because it is necessary for extracting results from figure images when analyzing the results reported in a paper.Thus, there have been several studies on machine learning models to capture the rough characteristics of figures like figure types (pie chart, bar chart,...).Clark et al. [10] proposed PDFFigures 2.0, a method for extracting tables in addition to images and graph captions from the literature in PDF format.Siegel et al. [6] developed an end-to-end system for detecting the location of figures in a research paper and parsing the results on figures.Kahou et al. [5] developed a visual reasoning corpus of questionanswer pairs grounded in images to obtain a more detailed analysis of figures by a machine learning system.

III. METHODS
In this section, we describe the proposed approach for extracting and classifying graphs in the materials science literature in detail.First, we build a dataset that classifies graphs according to the types of conditions from the actual graphs reported in papers.An overview of the property value graph dataset construction from published papers is shown in Figure 1.Then, we classify the graphs based on types of conditions by using the information in the graphs in a unimodal or multimodal way.An overview of our proposed classification method is shown in Figure 2.

A. CONSTRUCTING GRAPH DATASET
Our dataset is a set of triples of graph images, captions, and labels obtained based on property conditions.The dataset is constructed by extracting graph images from a large collection of published journal papers in the field of materials science.Then, we label the images leveraging crowdsourcing, enabling the creation of large datasets in short periods.In the following sections, we explain the construction in detail.

1) Label definition
To classify the images according to their corresponding conditions, we define the types of graphs using the following conditions.Examples of each type and graph are shown in Figures 3 and 4.

Tex t ex t rac t ed by graph images
A as a function of temper ature for B.
A as a function of temper ature for B.
A as a function of temper ature for B.
X as a function of temper ature for B.
Wavelength: The wavelength condition is a parameter often used to analyze the components of materials (see Figure 3(d)) by various methods, such as absorption spectroscopy.
Capacity: The capacity condition is a parameter related to battery performance.To represent the performance, charge and discharge curves are often drawn (Figure 3(f)).
Pressure: The pressure condition is a parameter related to the absorption and desorption operations on the surfaces of materials.For example, it is shown on the horizontal axes of absorption-desorption isotherms (see Figure 3(g)).
Ohm: The ohm condition is a parameter related to the performance of materials as counter electrodes.For analysis, Nyquist plots were drawn many times (Figure 3

(g)).
Voltage: The voltage condition is a parameter that reveals the electrochemical reaction mechanism of materials.For this purpose, discharge and charge curves are drawn (Figure 3(h)).
Cycle: The cycle condition is a parameter used for the evaluation of the material durability against various indices (Figure 4(k)).
Other: Graphs of measurement conditions other than the 11 listed above (Figure 4(l)).
Additionally, we define a label "not target" in classifying the figures reported in the paper, which indicates figures that are not graphs.Such figures include photographs, images of compounds, and diagrams with multiple figures, as shown in Figure 4.

2) Collecting paper data
The journal named "Journal of Material Chemistry A (2015-2019)" 1 from the Royal Society of Chemistry (RSC), which   publishes papers dealing with the synthesis of materials for batteries, is selected as the target.The papers provided by RSC are in XML format and contain figures, tables, and schemes.The statistics of the target article data are listed in Table 1.
As shown in Figure 5, a figure extracted from the paper may contain a single graph or multiple graphs, referred to as "one-graph figure" or "multiple-graph figure,", respectively.

3) Splitting one-graph and multiple-graph figures
We construct a dataset containing only one-graph figures to avoid any difficulties in splitting multiple-graph figures.Thus, the selection was completed as follows.
First, one-graph and multiple-graph figures were extracted using different methods.One-graph figures were extracted using the compound figure separator (CFS) proposed by Tsutsui et al. [25].The figures that were classified as 0 or 1 figure by CFS were chosen as one-graph figures.We extracted 8,634 one-graph figures from 91,019 figures obtained from 14,071 papers, as shown in Table 1.Multiple-graph figures were extracted by applying regular expressions to the figure captions.This rule judges whether a caption contains "(a-z)" or "(A-Z)" and extracts the corresponding figures as multiple-graph figures.For example, if we apply the rule to the caption "(a) X-ray diffraction patterns of samples; SEM images of (b) carbon spheres, (c and d) MnO 2 , and (e and f) MnO 2 /C composite" (Figure 1 of [26]), the figure is extracted as a multiple-graph figure because it contains four instances of "(a-z)" or "(A-Z)" in the caption.As a result, 64,361 multiple-graph figures were extracted.
Next, we designed a binary classification model for onegraph and multiple-graph figures using Lobe2 , which is a machine learning tool for easily training the custom image classifier.We randomly sampled 1,000 one-graph images and 1,000 multiple-graph images to train the classifier.We applied the trained classifier using Lobe to the 91,091 images from 14,071 papers and extracted 16,668 one-graph figures.

4) Annotation details
Annotation was performed using crowdsourcing to reduce the time required.We employed annotators via Amazon Mechanical Turk3 to label the extracted graph images.To ensure the quality of the dataset, a maximum of nine annotators were used to annotate one figure.The annotation process took approximately three days, and 732 annotators participated in this task.

5) Label integration
Our dataset contains up to nine labels per figure, but there are cases with variations in labels due to the complexity of the figure, misunderstandings, etc.Therefore, we implemented majority voting, which is a typical quality management method for crowdsourcing.Majority voting is implemented by selecting the label with the highest number of occurrences among the annotated labels.In addition, we set a threshold on the agreement rate in the majority vote and extracted only annotation data with agreement rates above the threshold.To determine the threshold, 100 figures were annotated by two annotators.The agreement between the two annotators was 83%.Considering that the number of annotators was more than two, we set the threshold to 80% Table 2 shows the size of the dataset after label integration.

6) Evaluation of the dataset
First, we randomly sampled 100 figures in the dataset and annotated them using an annotator.We evaluated the annotations by comparing them with the dataset labels.The results show that the accuracy of the labels in the dataset was 99% 4 , indicating that the labeling is accurate.

B. CLASSIFICATION WITH UNIMODAL INFORMATION
We utilized three types of information: graph images, graph text extracted using Textract5 , and captions for classification.
In this section, we discuss classification using each of the three information types.We prepared different classification models for each information type (unimodal information).In the following sections, we describe graph image classification, graph text classification, and caption classification in order.

1) Graph image classification
For image-based classification, we used two models, viz.
ResNet [27] and EfficientNet [28], which have been reported to perform well in image-based tasks and can easily employ transfer learning.We trained these models by fine-tuning them on the constructed dataset.
ResNet is a model that can learn deep layered models by introducing residual connections.In our experiment, the images were resized to 224 × 224 pixels, and we employed ResNet50.
EfficientNet is a model that achieves high performance with fewer parameters than conventional models through model scaling.EfficientNet-B0 was used in our experiments for the same reason as ResNet.
Here, we describe the prediction flow using these two models.
First, we create a feature vector x from the input image tensor x using these two models.Since each input image is an RGB image resized to 224 × 224 pixels, the dimensions of the input image tensor x are (224, 224, 3).
Please note that the dimensions of the feature vector x are different for these two models.The feature vector x is then passed to a single-layer fully connected (FC) network, and finally, the probability p of each label is calculated by applying a softmax function for prediction.
Here, θ denotes the model parameters, and y represents a set of candidate labels for prediction.We choose the label with the highest probability p for the input sentence representation x as the prediction result.

2) Graph text classification
In this section, we explain the prediction flow using text extracted by Textract from the graph images.
Since Textract extracts text word by word without any contextual information, the feature vector of the text t is represented by a bag-of-words (BoW) or a term frequencyinverse document frequency (TF-IDF).
The probability p of each label is predicted by passing the feature vector t through a two-layer fully connected network and applying a softmax function to the output.
Here, ReLU is the nonlinear activation function

3) Caption classification
Two models are used for classification with caption texts: a convolutional neural network (CNN) as a baseline and a CNN with Mat-word2vec [29] (Mat-CNN), which is domainspecific word embedding for materials science.
The CNN creates feature representations of sentences from randomly initialized word representations and predicts the labels based on the representations.In the following paragraphs, we explain the prediction process.
First, we obtain the representation of an input sentence s using random word representations.
where s is the list of 100-dimensional vector representations w of words, and N is the maximum sentence length.
Next, we generate a feature vector of the sentence ŝ using the CNN from the word representations s.
Finally, as shown in Eqs. ( 2) and (3), the feature vector s is passed through a single-layer FC neural network, and the probability p of each label is calculated by applying a softmax function to the output.We choose the label with the highest probability p for the input sentence representation s as the prediction result.
For Mat-CNN, we initialize the word representation w of Materials Science Word Embeddings, acquired by Kim et al. from the literature using word2vec [29].We predict the relation in the same manner as for the CNN described above.

C. CLASSIFICATION WITH TWO TYPES OF MULTIMODAL INFORMATION
We consider the prediction based on any two out of the following three types of information: graph image, graph text, and captions.
The feature representation vectors of two types of information are combined to produce the final feature representation vector h 1 .For instance, to create a feature representation vector from an image and its caption, we combine their representations, as shown in Eqs. ( 1) and ( 8): Finally, we pass the feature vector h 1 through a singlelayer FC neural network, according to Eqs. ( 2) and ( 3) and apply a softmax function to the output to predict the probability p of each label.

D. CLASSIFICATION WITH MULTIMODAL INFORMATION
We adopt two approaches to combine the three types of information: one involves combining features similar to the case of two multimodal models, and the other involves ensembling each type of model.
The method for combining the features is the same as that for using two types of information.Using Eqs. ( 1), ( 4) and ( 8), the feature vector for prediction is represented as h 1 = [x; t; ŝ].We refer to this method as the Concat method.
In the second method, predictions are made by multiple models trained separately on each type of information, and the final prediction label is determined by their majority vote.We refer to this method as the ensemble method.

E. IMPLEMENTATION DETAILS
We used a g4dn.4xlargeinstance of Amazon Elastic Compute Cloud (Amazon EC2) 6 6 as the computing environment.The g4dn.4xlarge instance contains second-generation Intel Xeon Scalable (Cascade Lake) processors and an NVIDIA T4 Tensor Core GPU.We used the machine learning library PyTorch7 to implement the model described above.

IV. RESULTS AND DISCUSSION
In this section, we evaluate the performance of the trained graph classifier using the constructed dataset.

A. EXPERIMENTAL SETUP
In this section, we explain the evaluation method and the setting of various hyperparameters for evaluation.

1) Evaluation method
We employed the holdout method to evaluate the classification performance, and the F-score was used as an evaluation index.The dataset was divided into training, development, and test datasets at a ratio of 6:2:2 such that the labels were distributed close to each other.

2) Setting various hyperparameters
The hyperparameters during training were determined by choosing them from the parameters presented in Table 3.We aimed to produce the highest microaveraged F-scores on the development dataset.Tuning was performed by pruning the branches with the successive halving [30] algorithm, implemented in a hyperparameter optimization framework Optuna [31].During tuning, early stopping was performed at 20 epochs.

B. GRAPH IMAGE CLASSIFICATION
We trained and evaluated EfficientNet and ResNet for graph image classification; the results are listed in Table 4. Effi-cientNet showed higher performance between the two models for all labels, while the labels of "Not target" and "Voltage" were low.This suggested that it is difficult to extract  the features of "not target" and "voltage" from the images, and the images of "not target" are considered to contain other diagrams (e.g., microscopic images, flowcharts, and other diagrams with diverse variations), which may increase the classification difficulty.The "voltage" images were also considered more difficult to classify than the other images because they are similar to some images of "energy".The microaveraged and macroaveraged F-scores for each label did not differ significantly among the models.However, the F-score for each label varied, indicating that each model was accurate for different labels.
In addition, to check the tendency of misclassification in the graph images, we employ the confusion matrix of Effi-cientNet, which shows the highest classification performance (Figure 6).The vertical axis shows the correct labels, and the horizontal axis shows the labels predicted by the model.Figure 6 shows that there were many errors in "Other" classification.In particular, there are many cases where "Other" is mistakenly predicted as "Capacity" or "Cycle" and "Time" is mistakenly predicted as "Other"."Other" contains various graphs that do not belong to the labels defined according to the material properties.Therefore, we consider that graphs that are similar in general shape with "capacity", "cycle", and "time" are also included in "other", which may have caused the error.Such error cases suggest that it is very difficult to classify information related to the context of the graph, such as properties, based on the general shape of the graph using images.

C. GRAPH TEXT CLASSIFICATION
The text extracted by Textract was classified by creating a feature representation using BoW and TF-IDF; the results are listed in Table 5.For all the labels, BoW produced higher Fscores than TF-IDF.Similar to image classification, we show the confusion matrix of BoW, which has the highest classification performance, in Figure 7.There were many errors in the "other" class.Particularly, there were many cases where "other" was misclassified as "angle" or "not target".Many of the images of "angle" have intensities on the vertical axes.However, the images of "other" occasionally have an intensity similar to that in the vertical axis.Therefore, we consider that the BoW of the "other" sometimes resembles that of the "angle", causing misclassification between them.We believe that "other" is mistaken as "not target" because it is difficult to create features for both.In addition, there are many cases where "voltage" was misclassified as "not target".The images of "voltage" occasionally have some text in figure images.We consider that the model misclassified "voltage" as "not target" because there are many images containing little or no text in "not target", such as microscopic images and flowcharts.

D. CAPTION CLASSIFICATION
We trained and evaluated CNN and Mat-CNN using only the figure captions; the results are shown in Table 6, with Mat-CNN showing the best performance.For all the labels except "ohm" and "Raman shift", Mat-CNN showed a higher Fscore than CNN, indicating that the word representations pretrained from the literature are useful for figure classification.
We show the confusion matrix of Mat-CNN in Figure 8. Several errors were observed for "time".Particularly, there were many cases in which time was misclassified as "cycle", "not target", "other", or "voltage".In several cases, the model also predicted "time" erroneously.In particular, there were cases in which "cycle", "not target", "other", or "voltage" were misclassified as "time".The distribution of labels that are easily mispredicted as "time" and those that are easily mispredicted by "time" are similar, suggesting that the contextual feature representation of "time"'s captions is similar to that of misclassified labels.

E. CLASSIFICATION WITH TWO-MULTIMODAL INFORMATION
When considering two types of information, we used the model that showed the best performance for individual training.Specifically, the EfficientNet, BoW, and Mat-CNN models were used for the figure image, the text extracted by Textract from the images, and the captions, respectively.The results are shown in Table 7. Image+Text and Text+Caption showed similar overall performances, but Text+Caption showed slightly higher microaveraged and macroaveraged F-scores for each label.Furthermore, the F-  score of Caption+Image was much lower than those of Im-age+Text and Text+Caption, indicating that the text extracted from the images is useful for graph classification.The Fscore for each label varied, indicating that the best label depended on how the information was combined.

F. CLASSIFICATION MODEL WITH ALL MULTIMODAL INFORMATION
Prediction was performed using all three types of information: the images of the figures, the text extracted by Textract from the images, and the captions.Similar to the classification with two multimodal pieces of information, we used the model that showed the best performance for individual training.
The results are shown in Table 8.The F-scores of the Concat method were higher than those for the classification with one or two types of multimodal information, indicating that both types of information are useful for figure classification.The Concat method showed higher performance in terms of both microaveraged and macroaveraged F-scores for  each label than the ensemble method, indicating that learning by creating a representation that integrates the three types of information is more accurate than learning by using the three pieces of information individually.

V. CONCLUSIONS
In this paper, we proposed an approach to classify graphs according to their property conditions.We constructed a manually annotated dataset for classification using property conditions and evaluated the same.We also proposed and evaluated deep-learning-based classification models in both unimodal and multimodal settings.To the best of our knowledge, this is the first study to classify graphs according to their property conditions using multimodal information with deep learning models.
The results showed that the models labeled the graphs and classified property conditions with a microaveraged F-score as high as 0.961.Furthermore, we showed that the simultaneous use of graph images, text in graphs, and captions can improve classification performance.
In the future, we will improve these classification models.We will consider several methods for improving the performance of these models.First, we will investigate the method to achieve more effective use of multimodal information.Second, we will consider incorporating figure in-text cita-tions in the manuscript.Third, we will construct a model that can handle multiple-graph figures.We will also consider the automatic extraction of information from graphs and other nontextual components.

Figure 1 :Figure 1 :Figure 1 :Figure 1 :
Figure 1: X as a function of temperature for B.

FIGURE 2 .
FIGURE 2. Example of graph classification process (a)).Time: The time condition is a parameter for the time course of each process and measurement in material synthesis (see Figure 3(b)).

TABLE 1 .
Statistics of the target article data

TABLE 3 .
Search range for hyperparameters

TABLE 4 .
Classification using graph images

TABLE 5 .
Graph text classification FIGURE 7. Confusion matrix of gold labels (vertical axis) vs. BoW predictions (horizontal axis)

TABLE 6 .
Classification using captions FIGURE 8. Confusion matrix of gold labels (vertical axis) vs. Mat-CNN predictions (horizontal axis)

TABLE 7 .
Classification with two types of multimodal information

TABLE 8 .
Classification with all multimodal information