Greek Literary Papyri Dating Benchmark

. Dating papyri accurately is crucial not only to editing their texts but also for our understanding of palaeography and the history of writing, ancient scholarship, material culture, networks in antiquity, etc. Most ancient manuscripts oﬀer little evidence regarding the time of their production, forcing papyrologists to date them on palaeographi-cal grounds, a method often criticized for its subjectivity. In this work, with data obtained from the Collaborative Database of Dateable Greek Bookhands, Baylor University, an online collection of objectively dated Greek papyri, we created a dataset of literary papyri, which can be used for computational papyri dating. We provide an experimental benchmark on this dataset, by ﬁne-tuning four convolutional neural networks (CNNs) pre-trained on generic images.


Introduction
The object of papyrology is reading, studying, interpreting, and exploiting ancient texts preserved on papyrus [1]. In reality, however, we cannot define this discipline based on their writing material [2], considering that a papyrologist also studies texts surviving on parchment, ostraca, wood, bone, stone, and fabric (but not inscriptions, therefore the writing medium must be portable). These texts are exactly the same as the ones surviving on papyrus and they come from the same societies and date to the same periods of time [1]. Therefore, it would be more appropriate to adopt Bagnall's definition [2] that "papyrology is a discipline concerned with the recovery and exploitation of ancient artifacts bearing writing and of the textual material preserved on such artifacts". In terms of content, we can define two main categories of papyri: literary papyri, bearing texts of literary interest, and documentary ones, bearing texts of various topics of daily life, such as contracts, tax receipts, business letters, etc. [1] 4 . Dating papyri is considered particularly important for the interpretation and the assessment of their content [1]. Documents are often much easier to date, since they frequently bear a date or some reference to known people, institutions, offices or other evidence helpful to that direction. Nonetheless, chronological attribution is not always straightforward: the writers of private letters for the most part did not record dates, while literary texts remain dateless [3]. So what methods do papyrologists apply in these cases?

Background
Turner [3] in his work "Greek Manuscripts of the Ancient World" describes some of the methods employed for papyrus dating. In some cases, archaeological evidence may be of assistance, like the papyri from Herculaneum, which we know were written before 79 BC., when the volcano of Vesuvius erupted. Furthermore, when a document and a literary papyrus are found together in a mummy cartonnage, we can trust the date of the dated documents as a terminus ante quem for the literary text, since both papyri were discarded at the same moment as useless paper. More trustworthy are the dates we can extract when the backside of a papyrus is reused. More specifically, when there is a dated document on the front side (the recto side), then we know that the text on the back (the verso side) was written or copied after the date of the dated document. Conversely, if the dated document is on the back of the papyrus, we know that the text on the front was written or copied before the date of the document. However, in this case we cannot be sure of the time gap between the two. In the event that none of the above evidence is offered for dating, we can take into account the content, such as events that are described or "exploit fashions in 'diplomatic' usage, such as the use of and form taken by abbreviations" [3].
The method used predominantly to get more accurate results, especially when all the other criteria are absent, is based on palaeography, i.e. the study of the script. Dating on palaeographical grounds is based on the assumption that graphic resemblance implies that the two manuscripts are contemporary [4], as literary papyri are written in elaborate and conservative more formal writing styles that remain unchanged for decades or even centuries, whereas documentary papyri are almost always written in cursive scripts that can be dated with relative accuracy [1]. However, this distinction is not absolute, considering that, as stated by Choat [5], "many dated documents, and the scripts of some of these are sufficiently similar to those of literary papyri for them to form useful comparanda to the latter" and, as Mazza [4] adds, frequently documentary papyri are written in literary scripts and vice versa literary papyri are copied in documentary scripts. Therefore, it is obvious that relying solely on palaeography is a great challenge that presents plenty and considerable difficulties. For the chronological attribution of a papyrus, the papyrologist should have "a wide range of potential comparanda and have them available for easy consultation" [5]. This is not an easy task nor can be achieved without the proper training. Besides, as stated above, the fact that literary texts almost never bear a chronological indication, results in a very small number of literary papyri, securely dated, that can form a basis for comparison. On the other hand, to estimate the date of a papyrus one should take into account all the parameters, like the provenance, the context, the content, the language, the dialect, the codicology, the page layout, the general appearance of the script, the specific letter shapes of the papyrus under examination [5]. Lastly but most importantly, we should not overlook the subjectivity of the whole method, a parameter to which, according to Choat [5], is given less regard than should be.
In recent years, there have been efforts to date manuscripts of various languages with the help of computational means [6]. In reality, what these tools and techniques are trying to achieve is the chronological attribution of the manuscripts, based on the palaeographic assumption of the affiliation of scripts, described above, trying, nonetheless, to eliminate the subjective element of this method. However, our study of the literature makes it clear that most of these computational approaches disregard manuscripts written in the Greek language.

Our contribution
Greek papyri form a distinct collection of ancient manuscripts. Despite sharing characteristics shared by all ancient (as well as modern for that matter) handwritten artefacts, they also have a number of properties unique to them, which call for new research, specifically dictated by and centred on their specificities. Such properties are the time and geography of their production (which includes the materials involved, e.g. papyrus and ink), format, state of conservation and, most importantly, writing culture in the Graeco-Roman world and the evolution of Greek script and writing. Unlike ancient manuscripts in other languages, collections of Greek papyri are both plenty and scarce, both uniform and diverse. They are scarce compared to medieval Greek manuscripts (particularly in size), but still numerous. They (almost) unfailingly come from Egypt, a small fraction of the Greek-speaking ancient world, but they exhibit sufficient diversity in their content, form, and script, to merit separate and distinct examination.
This research is an initial investigation into the computational dating of Greek papyri. It exploits available data resources to explore machine learning methods that may assist papyrologists by computing a date for papyri of unknown dates. Using data obtained from an online collection of securely dated papyri, a machine-actionable dataset was created, suitable for the task of computational dating of Greek literary papyri. We used this dataset to train deep learning classifiers, benchmarking their ability to estimate the papyri date on the datasets. The best results in both genres were achieved by a machine learning algorithm trained on top of frozen (non-trainable) image embeddings derived by a pre-trained Convolutional Neural Network. The remainder of this article first summarises the related work and then describes the presented dataset. The methodology used is presented next, followed by the experiments that were undertaken on the dataset, and a discussion. A summary of the findings and suggestions for future work concludes this article.

Related Work
Dating of papyri (images) with computational means has been studied for many languages [8,9,10,11,12], but not for Greek. Dating the text image is very different from dating the text of the image, as has been done for ancient Greek inscriptions [13]. For instance, the latter requires transcription of the text in the image, which is a time-consuming process. Such a technique is also irrelevant in the case of literary papyri, because the texts that they transmit typically date much earlier than the actual manuscripts (e.g. a scribe in Late Antiquity copying on papyrus a Homeric poem composed more than a thousand years earlier). Also, any information regarding the script or clues aside of the text will be disregarded. Given the absence of dating methods for Greek, the following overview focuses on the studied dating methods for other languages.
The employed methods usually were standard machine learning methods, such as KNN [12], decision trees [8], random forests [8] and support vector machines [11,9,17,18,19]. Textural features, such as Gabor filters, Uniform Local Binary Patterns and Histogram of Local Binary Patterns are extracted and then fed to the classifiers [16]. The writing style evolution, however, has also been used as an intermediate step [9,12]. In this case, the periods are first aligned with specific writing styles. Then, any new manuscript is dated based on the detected style.
Pre-trained convolutional neural networks have been used to extract features, which are passed to a classifier or regressor [11,14], or used in combination with text features extracted with optical character recognition methods [10]. Transfer learning has been reported to lead to human performance [14]. This was deemed to be the most promising direction for the present study on Greek manuscripts, and was, hence, employed.

Data
should be clarified that the online collection used was compiled on the basis of the script and not the content of each papyrus (there are some documents and subliterary papyri in the CDDGB) and, therefore, the dataset developed maintain this criterion of categorization of manuscripts. In some cases only a limited sample of the papyrus images was used for training, as only a small number of papyri was available, allowing for a slight increase in training data in the future.

PaLIT
The Collaborative Database of Dateable Greek Bookhands is an online catalogue of ancient Greek manuscripts written in literary script, from the 1st to the 9th century A.D, hosted by Baylor University. The data it contains can be dated based on some kind of objective dating criterion, such as the presence of a document that contains a date on the reverse side, or a datable archaeological context associated with the manuscript. The list of papyri included in this dataset could have been more comprehensive, as extensive bibliographic information is not included and secondary literature has not been consulted. Such tasks have already been undertaken by two ongoing -and highly anticipated -projects (report from the introduction of the CDDGB website). However, for lack of a better alternative and since the collection of objectively dated bookhands goes beyond the scope of this study, this collection is deemed adequately reliable. Moreover, it is unlikely that a complete list of securely dated literary papyri would increase the number of specimens beyond the lower hundreds. Fig. 1 below shows in detail the distribution by century of the image data taken from this collection. The total number of images used form PaLIT is 161. Collected images were in JPG or PNG format and their resolution varied. A few images in gif format are also included in the collection but we excluded them owing to their poor quality. Specimens in the collection written in minuscule script were excluded too, due to the fact that minuscule Greek cannot be placed confidently into the script evolution process and appears after the period on which this study focuses.Finally, a challenge we had to deal with was duplicates, that is the multiple images that are in many cases available for a single papyrus, each of which depicts a different part of it. Thus, we chose one representative image for each papyrus to the exclusion of the rest.

Transfer learning
We employed four CNN architectures to predict the date of a papyri image including DenseNet [28], VGG [23], EfficientNet [27] and ResNet [26]. These models consist of multiple convolutional layers followed by pooling layers and a classifier, which varies from one dense layer to an MLP with several layers. In contrast with traditional CNNs where the layers are connected subsequently, DenseNet connects each layer with every other layer. VGG uses convolutional  layers with small filters (3x3), and max and spatial pooling layers, while ResNets are deep CNNs that are trained with residual learning. EfficientNets is a family of models that achieve state-of-the-art results and were created using neural architecture search and compound coefficient scaling. For each model we load a set of weights pre-trained on ImageNet, a dataset of over 14 million images belonging to 1000 classes. Then, we remove the classifier, which is pre-trained on these classes and replace it with a dense layer and softmax in order to obtain a probability distribution over the centuries that serve as labels in our task. The final model is fine-tuned end-to-end on the PaLIT dataset.

Experiments
The experimental settings and the selected evaluation measures are discussed in this section, followed by a presentation of the results.

Experimental settings
For our experiments we removed instances of centuries that occurred less than 10 times. The remaining 144 instances were split in a stratified way to 100 training, 22 validation and 22 test instances. Experiments were undertaken with Google Colaboratory, using a 15GB NVIDIA Tesla T4 GPU. The code was implemented in PyTorch and the pre-trained weights of the models were loaded from torhvision. We used the following versions of the CNNs architectures: DenseNet-121, VGG-16, EfficientNetV2-L and ResNet-101. These models require a minimum input size of (224, 224). The images in the PaLIT dataset have average width 1616.33 and height 1774.8. We experimented with three different settings: 1. Resize (224): the images were resized to the minimum required size, 2. Random crop (224): the images were cropped randomly to the minimum required size, 3. Resize (448): the images were resized to (448, 448). 5 In addition, the images were transformed to RGB and normalized with the mean and standard deviation of ImageNet. We trained the models using cross-entropy loss and SGD with momentum as the optimizer. Early stopping was used with a patience of three epochs. Each model was trained three times with different seed initialization.

Evaluation measures
The performance of all models was measured using F1, which is the harmonic mean of Precision and Recall. Macro-averaging was used across the centuries, in order to put an equal interest in the performance of all centuries and not only of the most frequent ones (e.g., 2nd and 3rd; see Figure 1). Classification evaluation measures do not capture how far a model has failed in its estimates (here, regarding the century) in relation to the ground truth. Hence, the Mean Absolute Error (MAE) was also employed, defined as the sum of the absolute difference of each predicted value from the respective ground truth of all the test set samples, divided by the total number of samples. Table 1 presents the experimental results. We observe that the best score overall is achieved by DenseNet using random cropping. For DenseNet and VGG, better scores are achieved using random cropping and 448-resizing, compared to 224resizing. It is reasonable to infer that resizing to smaller sizes makes the task more difficult and important information is missed. Interestingly, however, this is not the case for EfficientNet and ResNet which perform better when resizing the images to 224. EfficientNet, specifically, has competitive results in the standard resizing setting (224). Hence, we conclude that resizing to smaller sizes should not be disregarded as an option when fine-tuning pre-trained image classifiers.

Century-level dating
The use of the century as the unit for dating the papyri was dictated by the image labelling in the training set of literary papyri (PaLIT). As broad a range as it may seem to the non-expert, most papyrologists assign specimens to dates spanning a whole century, sometimes even two centuries. Rarely, papyri are assigned to one half of a century, or a vague 'early' or 'late' part of a century. Even in documents, which frequently carry a precise date (often down to a day) the ones that do not, are also assigned dates spanning whole centuries. 6 Apart from constituting standard practice, there are also valid theoretical concerns about assigning narrow ranges when dating on palaeographical grounds [7].

Challenges and limitations
The lack of data, meaning objectively dated papyri that could be used for model training, is an important and great challenge that we faced. The lack of publicly available machine-actionable data is most probably due to papyri licensing issues. A thorough investigation of these issues has not been performed by the authors. Class imbalance characterises the presented dataset. The distribution of papyri per century is very heterogeneous (Figure 1), to the point that some centuries have few to almost no samples, while others have a significant representation. For example, a single sample of literary papyri is included in PaLIT from the 1st BCE and two from the 7th CE. By contrast, the 2nd and 3rd centuries CE are supported with 87 and 71 samples respectively. This imbalance naturally affects the results and introduces a limitation, because it leads to a poor performance of the models when they are called to date manuscripts of the centuries that have minimal representation from samples. Inaccurate ground truth is a final limitation. The dates that are assigned to the papyri by objective criteria [3] are often not entirely precise or accurate but estimates with an error probability of about 50 years. This means that a manuscript attributed to a certain century may have been written in the previous or the following century. These dates, however, serve as the ground truth for machine learning, and hence noise may be present.

Conclusion
This study presented experiments with transfer learning for the challenging task of dating Greek papyri. We used data from two online collections of objectively dated papyri, organised in a machine-actionable form. Experimental analysis showed that a DenseNet CNN, fine-tuned on randomly-cropped patches of 224*224 pixels, achieves the best mean absolute error (1.17). Given that the ground truth estimates come with an error probability of 50 years, we find that room for improvement exists. Future work will attempt to establish better results, by extending our dataset and by experimenting with augmentation.