Bones of contention: a double-blind study of experts’ ability to classify sheep and goat astragali from images

In zooarchaeology, animal bones are normally identified using comparative macro-morphological methods, which involve visual comparison of the bone with reference materials. However, recent work has oppugned the reliability of these methods. Although previous studies applying macro-morphological methods to identify sheep and goat bones have found low error rates, these results are based on small numbers of analysts and large numbers of different bone types and do not properly account for ambiguous ‘sheep/goat’ classifications. We present an extensive blind study of performance and reliability for binary macro-morphological species identification using just the astragalus. Each participant made independent comparative identifications on a random subset, including repeat presentations for consistency analysis. No sheep/goat category was offered. Instead, participants reported confidence scores on each sample. The participants also reported the reference materials used and indicated their regions of attention in each image. Findings indicate that neither the use of reference materials nor experience is a good predictor of accuracy, although more experienced analysts are found to be more consistent. Forcing binary classifications leads to a more transparent analysis but indicates lower performance scores than reported elsewhere, while corresponding confidence scores positively correlate with accuracy. Qualitative analysis of reported attention regions indicates that mistakes can occur when there is an overlap in the morphologies of the two species. We conclude that overreliance on reference materials impacts performance when the morphology of reference materials is not representative of the population variance, which is especially evident when the wider bone morphology is not adequately integrated into the classification decision.


Introduction
Many zooarchaeologically relevant blind studies have been published since Driver's (1992) discussion of the problems in zooarchaeological identifications.These blind studies have covered varying topics and they have varying aims, including (1) the comparison of laboratory-based techniques and comparative methods of identification (Greenlee and Dunnell 2010; Welker et al. 2015;Pilaar Birch et al. 2019;Prendergast et al. 2019), (2) the exploration of analysts' interpretations and methodological decision making when given the same assemblage (Gobalet 2001;Atici et al. 2013;Giovas et al. 2017), (3) the assessment of published criteria (Fernandez 2001;Zeder and Lapham 2010;Zeder and Pilaar 2010;Twiss et al. 2017), (4) the reproducibility of an assemblage (Nims and Butler 2017;Lau and Whitcher Kansa 2018), (5) the reproducibility of the identification of some features (Blumenschine et al. 1996;Lloveras et al. 2014), ( 6) the reproducibility of applying metric measurements (Lyman and VanPool 2009), and (7) the impact of fragmentation (Pickering et al. 2006;Domínguez-Rodrigo 2012;Morin et al. 2017).Together, these studies demonstrate inter-and intra-analyst variation in a wide range of zooarchaeological tasks, resulting from differences in the analysts' research backgrounds and innate abilities.The observed inter-analyst variation therefore leads to zooarchaeological identifications being subjective, even though this subjectivity is limited and controlled by a variety of reference materials 187 Page 2 of 35 including guides and manuals, both physical and virtual 3D reference specimens, and images and sketches of bones.Inter-analyst variation is commonly acknowledged and accepted in the zooarchaeological community, with for instance Twiss et al. (2017, p. 303) considering zooarchaeological identification to be an 'interpretative act ', and Wolverton (2013, p.390) maintaining that it is 'important to acknowledge that morphological identification is a subjective process'.In referring to subjectivity, we mean that the results of zooarchaeological research are dependent on and limited by the analysts' research histories and abilities, and it is the inter-analyst variance of classification accuracy that is of interest.Regarding intra-analyst variance, we refer to consistency.
When the subjective nature of the zooarchaeological identification process is combined with bones and species that are difficult to separate, one should expect a high level of inconsistency and inaccuracy.It is extremely well-known within the zooarchaeological community that sheep and goats are difficult to differentiate from their skeletal remains, to the extent that Noddle (1974, p. 195) called the problem 'legendary'.However, because sheep and goats are globally important domestic species, they played a key role in the early domestication process, and they are highly important to sedentary communities (Zeder 2008;Culley et al. 2021), it is important that sheep and goat bones are reliably identified by zooarchaeologists and that zooarchaeologists can trust each other's identifications of these elements.The importance of separating sheep and goat is also reflected in the fact that one of the first major uses of ZooMS (Zooarchaeology by Mass Spectrometry) was performed to separate sheep and goats (Buckley et al. 2010).
The astragalus was chosen as the focus of this study because they tend to survive relatively intact in archaeological contexts due to their relatively high density and low nutritional value (Haruda 2017;Pöllath et al. 2018), which means that they are rarely broken intentionally, although astragali are occasionally re-purposed as gaming pieces (Gilmour 1997;Koerper and Whitney-Desautels 1999;Holmgren 2004).Furthermore, several publications have used sheep and goat astragali as the subject of their research in showing that their methods can separate the two species (Davis 2016(Davis , 2017;;Haruda 2017; Salvagno and Albarella 2017; Salvagno 2020), whereas others have argued that descriptive morphological criteria can be defined to separate sheep and goat astragali (Boessneck et al. 1964;Boessneck 1969;Prummel and Frisch 1986;Fernandez 2001;Zeder and Lapham 2010).Sheep and goat astragali are therefore ideal for testing the accuracy and consistency of zooarchaeological analysts as the community is familiar with the problem and there are many studies that provide context for the present study.Likewise, existing osteometric and comparative methods provide a point of comparison for analyst performance in this blind study, as does the prior blind study of Zeder and Lapham (2010).Zeder and Lapham's (2010) blind study is considered at length in the next section for context.

Analyst performance in identifying sheep and goat astragali
Our study builds upon previous work by Zeder and Lapham (2010), who produced an in-depth analysis of their blind study, which tested a smaller sample of analysts on a broader set of sheep and goat bones than ours.To their credit, Zeder and Lapham (2010) not only recognised the importance of accuracy and consistency requirements in zooarchaeological macro-morphological identification, but also were the first to perform an extensive blind study, and the following reanalysis was only possible due to them making the raw data readily available.Because of the influence and importance of Zeder and Lapham's (2010) work in zooarchaeology, especially their argument that macro-morphological comparisons produce excellent classification accuracies, we use their study as a relevant backdrop to highlight prevalent issues in the zooarchaeological process as a whole, namely the influence of the hierarchically incompatible categories (sheep/ goat versus binary classifications) on the classification accuracy.Likewise, we use Zeder and Lapham's (2010) work as an example of how macro-morphological comparisons are dependent on analyst decisions, and therefore, descriptive methods are not always followed consistently.We do not claim that these issues were consciously ignored, only that they happen, nor do we aim to discredit their work.Moreover, we solely focus on the astragalus classifications.
In their assessment of the postcranial elements, Zeder and Lapham (2010, Table 5) found that using descriptive criteria to classify sheep and goat astragali results in a correct classification rate of 100% for goat and 97% for sheep astragali.This finding was then adopted by Pöllath et al. (2019, p.812) who stated that they 'consider it unlikely that our samples contained misidentified goat astragali' to bolster the readers' trust in the morphological assessment that their study was reliant upon.This is a risky stance to take considering that Zeder and Lapham (2010) reported 2.2% and 15.4% rates of sheep/goat identifications for goat and sheep astragali, respectively, which indicates a large amount of uncertainty especially around sheep astragalus identifications.Zooarchaeologists have grown accustomed to accepting sheep/goat almost as if it is yet another 'species' and such a classification is often considered a 'correct' classification (e.g.Davis 2017, p. 67), or ignored from the total number of classifications when computing accuracies as done by Zeder and Lapham (2010).As bones classified as sheep/goat present a considerable amount of uncertainty to the practitioner (Wolfhagen and Price 2017), their exclusion from the total number of classifications means that the accuracy rates for individual species may be inflated.We provide an example of this issue in Online Resource 1.Although the sheep/goat category may also be thought of as a 'reject' class, it should not be considered a pure reject class either, since such a role is already reserved for the 'unidentified' or 'indeterminate' category.Instead, sheep/goat classification reflects a higher hierarchical and taxonomical level of classification-and is therefore closer to a 'catch-all' than a 'reject' class-which is incompatible with binary classifications such as sheep and goat identification.Moreover, even if combining the two species increases sample size and produces a general picture of caprine exploitation, doing so also obfuscates the potential differences in the management strategies for the two species (Halstead et al. 2002;Buckley et al. 2010).
In addition to reporting the accuracy for each element, Zeder and Lapham (2010) report the classification rates for each individual criterion for each element as well.For the astragalus, these criteria involve the following: (1) the angle of the medial articular ridge in the dorsal aspect, (2) the shape of the distal articular surface in the lateral aspect, (3) the size and shape of the proximo-plantar projection in medial aspect, and (4) the prominence of the medial articular ridge in plantar aspect.In Zeder and Lapham's (2010) Table 3, in which each of the four criteria used in differentiating sheep and goat astragali were assessed individually, the error rates are reported as follows: (1) for the first criterion, the error rate is 0% for goats and 24.7% for sheep with 0% sheep/goat identifications for goat and 6.4% for sheep; (2) for the second criterion, the error rate is 4.6% for goats and 12.3% for sheep in addition to 4.4% sheep/goat for goats and 6.4% for sheep; (3) for the third criterion, the error rate for goats is 4.6% and 2.8% for sheep, which is again accompanied with 4.4% sheep/goat for goats and 6.5% for sheep; and (4) for the final criterion, the goat astragalus error rate is 7.1% and 7.4% for sheep, and there are 8.7% sheep/goat identifications for goat and 12.8% for sheep.However, these reported error rates were computed for intact specimens, and the error rates for archaeological specimens would be different as correct identifications are dependent on the survival of these four features of astragali.Moreover, although these error rates and percentages of sheep/goat identifications are seemingly low, they are not to be taken at face value as they derive from the authors' assessments of the bones even though they were aware of the ground-truth species, calling into question the validity of such a study, a point which Zeder and Lapham (2010) readily admit and attempt to correct by also reporting on results of a proper blind study.The same warning applies to the whole bone classification rates reported in Zeder and Lapham's (2010)  Considering the aforementioned issue regarding Zeder and Lapham's (2010) assessment methodology, the fact that all relevant features may not be present in archaeological samples and that the morphological variances of archaeological populations are likely to be different from modern samples, it is therefore not justifiable to use the identification rates from Zeder and Lapham's (2010) Tables 3 and 5 as expected accuracies in archaeological studies.Furthermore, taking the reported error rates at face value makes the assumption that all analysts would perform equally well when following Zeder and Lapham's (2010) criteria.
In fact, the blind study part of Zeder and Lapham's (2010) research demonstrates the exact opposite.In their blind study, six analysts, including the two authors, attempted the identification of ten specimens from each species for several elements, demonstrating inter-analyst variance as well as how the analysts' error rates differ when classifying sheep versus goat bones.According to Zeder and Lapham (2010), the astragali were analysed by all six participants, but from the tables included in the article and the associated Supplementary Table 4-in which the blind study results were detailed-it appears that only five of the participants analysed the astragali.Zeder and Lapham's (2010) Supplementary Table 4 is the basis for the following two paragraphs.
In the blind test, the analysts had two tasks: (1) classify individual features as sheep, goat, or sheep/goat and (2) classify the bones as sheep, goat, or sheep/goat.This then means that there are two ways in which the analyst accuracy can be computed: (1) majority-voting, in which each criterion has one vote and sheep/goat classifications are ignored or (2) taking only the final decision into account, again ignoring sheep/goat classifications.Treating the ambiguous sheep/ goat identifications as a reject category and excluding such classifications from the computations, the mean accuracy for all analysts was 94.74% (Table 1) for the astragalus identifications when only the final decisions are taken into account, and 89.36% (Table 2) when the species identifications were computed using majority-voting by criteria (Zeder and Lapham 2010).However, these accuracies are inflated by the exclusion of sheep/goat classifications from the total number of classifications.The authors (analysts 1 and 2) of the article did not fare any better than the less experienced analysts (analysts 4, 5, and 6 in Table 1) when just final decisions were taken into account and performed worse than the inexperienced analysts when considering each criterion as a vote (Table 2).
Zeder and Lapham's (2010) Supplementary Table 4 further shows that for two astragali, analyst 1 (one of the authors) scored all four criteria individually as either goat or sheep/goat but then decided that the bone was actually sheep.In contrast, analyst 2 (again, one of the authors) assigned one astragalus as a goat even though they thought that the bone adhered to sheep-like qualities on three of the four criteria.Furthermore, there were four other instances (all analysts) in which a bone was considered sheep/goat even though the majority of the criteria pointed to a more 187 Page 4 of 35 precise species assignation.In three of the four such cases, the decision to provide a more precise classification would have resulted in a wrong answer.
This behaviour by the analysts to disregard the morphological criteria highlights the problem with the methodology, namely that the individual criteria do not fully capture the species variation which leads to inconsistent application of the comparative methodology.It is also a distinct possibility that size and other undescribed morphological variables affect the analysts' decision-making.However, although the effect of undescribed morphological variables is unmeasurable and cannot be easily removed as a factor, bone size as a factor can be removed, and doing so could help in producing a more accurate assessment of the reliability of descriptive criteria.This can be most easily achieved by using high-definition images of bones, finding the outline of the bones, and cropping the image to the extent of the bone, after which all images can be scaled to a pre-determined size by padding the cropped image with a varying number of pixels.To ascertain that analysts are consistent and follow the written criteria, it would be similarly helpful to collect data on the analysts' areas of attention during the classification task.Again, photographs are helpful in achieving this goal as participants would only have to trace their attention areas over the regions of the photo.Moreover, the image dataset used in the present study is suitable for future deep learning applications, meaning that the blind study acts as a benchmark of human ability for any deep learning classifiers using these images.

Aims and objectives
This study has several aims: (1) to test the impact of different types of reference materials in a zooarchaeological task, (2) analyse the participants' spatial attention during classification, and (3) as it is argued that sheep/goat classifications are not only unnecessary, but that they also hide important information about the classifications this category is replaced by a two-step process in which the analyst is forced to make a classification and additionally report their confidence score for each classification.These self-reported  confidence scores force the analysts to consider the uncertainty in their classifications, and it is shown that higher confidence scores also correspond to higher accuracy overall.This study additionally provides a human benchmark for an archaeologically relevant image dataset for forthcoming deep-learning algorithms.The data was gathered in an online blind study, and it is analysed in a series of quantitative and qualitative analyses.

Image dataset
The specimens included in the dataset come from the National Museum Cardiff's Noddle collection, Sheffield University's zooarchaeological reference collection, and Historic England's collections stored at Fort Cumberland in Portsmouth.The list of specimens is provided in Online Resource 2. In total, 193 astragali (100 sheep, 93 goat) were photographed from six different views.However, nine specimens were not included in the blind study as there were problems with the quality of photographs of two specimens, and another seven had species information written on the bones themselves.The photography and post-processing steps are detailed in Online Resource 3 and an example of the six images (as shown to study participants) produced for each astragalus from this process is displayed in Figure 1.

Data collection
The blind study was hosted at www. sheep goat.co.uk and took place between 23 June 2020 and 31 December 2020.In total, 39 fully anonymous participants completed the entire study.Each analyst was first asked to consent to the test as required to fulfil the ethics requirements (ethics approval granted on 15 June 2020 by UCL Institute of Archaeology Ethics Committee, Reference number 2020.020), followed by a series of questions about their experience in zooarchaeology and the type of reference materials they would be using (detailed in Online Resource 4).After the pre-test survey, the participant was taken to a page detailing the instructions for the two main tasks of the study (see Online Resource 5).The first of these two tasks involved showing the analyst an astragalus from all six views and asking them to identify the specimen as either sheep or goat as well as give an estimate of the analyst's confidence in that classification (a sliding scale from 'Guess' to 'Absolutely certain', or numerically from 1 to 100 with a default value of 51).The participants were given the option to indicate whether they recognised the bone from reference collections because the participant may be from the same institution as the specimen-this feature was used only once and that classification is not taken into account in any analyses.The second task involved painting the areas that the analyst thought were informative for the species assignation on top of the set of images shown on the previous page.For Analyst 29, there was a glitch in their entries, and they provided 29 classifications and 31 drawings-only those drawings with corresponding classifications are included in the analysis of analyst attention.
After the test, the participants were allowed to leave feedback, and they were given the option to view how well they performed.The participants had the option to delete their entries throughout the test and even after it.The database entity relationship diagram is in Online Resource 6.The analyst answers and test images are in Online Resource 11.
Because asking participants to classify all 184 bones would have been too time-consuming, a random sampling strategy was chosen.This was implemented as part of the functioning of the website, which picked 20 astragali (ten of each species) for each analyst.As it was of interest to also gauge analyst consistency, another ten astragali (five of each species) were chosen at random from those initial 20.The participants were not told that their consistency was going to be measured, but they were told that the test would involve 30 specimens.Had the analysts been told about consistency testing, they may have tried to memorise their initial identification, which could have led to artificially improved consistency scores.It was programmatically ensured that none of the repeated bones was shown immediately after its first occurrence.Because of the randomised selection of samples for all participants, one of the 184 astragali was not shown to anyone.

Measuring expertise and defining expertise groups
Qualifications, years in the profession, and track record are all poor predictors of test performance (Burgman et al. 2011).Previous studies have found this to be the case in other species identification tasks such as fisheries observers' ability to identify sharks (Tillett et al. 2012) and great crested newt licence holders' ability to sort images of newts to species (Austen et al. 2018).Considering the fact that these proxies for experience are not reliable predictors of expertise, it was decided on an alternative approach to group analysts by expertise.In this study, the analysts were aggregated into expertise groups based on their answers to five questions, the exact wordings of which can be found in Online Resource 4. In general, these questions were designed to act together as a proxy for the participants' expertise in sheep and goat identification, not faunal analysis overall.
The participants were asked about the following: (1) their highest level of zooarchaeological qualification, (2) how many zooarchaeological assemblages they have worked on in the last five years, (3) how many hours per week on average they spent analysing zooarchaeological remains in the last 5 years, (4) whether they are specialised in the identification of land mammals, and (5) if their specialist experience involved sheep and goat separation.For the last two questions, the participants were given the option to select 'Not applicable'.In hindsight, this was a mistake, so these answers are interpreted in the analyses as 'No'.The first three questions were chosen because they were assumed to reflect overall experience and/or continued practice, both of which were assumed to correlate with a larger mental reference population, while continued practice is strongly correlated with skill (Ericsson and Lehmann 1996).These questions focused on the last five years (an arbitrary choice) because domain knowledge diminishes if it is not practiced (Endsley and Kiris 1995), and it could not be assumed that all participants were practicing zooarchaeologists.Furthermore, it is likely that most participants have at least some form of underlying training in mammalian zooarchaeology, although this was not asked specifically.The grouping of analysts was done with the combination of K-medoids and principal components analysis (PCA), as discussed next.

PCA and K-medoids
The answers to the above questions were pre-processed by centring them around zero and scaled to unit variance due to a varying number of options for each question, as recommended (Abdi and Williams 2010).This was done using Scikit-learn's StandardScaler method (Pedregosa et al. 2011) in Python 3.6 (Python Software Foundation 2016).Once the principal components were computed, those principal components that explain over 80% of the variance were used as input to K-medoids clustering, which aggregated analysts into groups of similar skill-level.K-medoids are more robust to noise and outliers than K-means clustering as the cluster centres in K-medoids are the most centrally located objects, whereas in K-means the cluster centre can be between objects (Zhang and Couloigner 2005).The number of analyst groups created was based on the combined evaluation of the sum of squared distances (the elbow method), Calinski-Harabasz index, and silhouette score.

Analysing the impact of reference materials
In addition to asking about the participants' use of reference texts (one of Boessneck 1969or Boessneck et al. 1964, Zeder and Lapham (2010), Prummel and Frisch (1986), Other, and None), the analysts were asked to provide information on whether they were planning on using physical specimens, photographs, sketches, or 3D models of either or both species as reference aides during the test.This information was important in measuring the impact of different reference material types on the analysts' performances.Although it is acknowledged that it would be better to have a control group and separate sessions for all participants so that the impact of different reference materials was more directly measurable, this was not possible within the research timeframe, and it was thought improbable that large enough cohort of participants could have been found for a such a multi-stage study.

Generalised linear mixed effects model
Instead of approaching the impact of reference materials in an unfeasible longitudinal study, generalised linear mixed effects models (GLMM) are used to model the relationship between the different types of reference materials and the analysts' classifications.GLMM is a statistical modelling method that merges the properties of linear mixed models (LMM) and generalised linear models (GLM; Bolker et al. 2009).Like GLM, GLMM involves the use of link functions to model non-normal data, whereas GLMM resembles LMM in its use of random effects (Bolker et al. 2009;Stroup 2013).Thus, GLMM is preferrable over GLM and LMM when the response variable is non-normal and one can expect variance beyond that explained by the measured variables (e.g.type of reference material).Additionally, GLMM can be effectively applied to unbalanced and zero-inflated experimental designs, which is the case here (Bolker et al. 2009;Moscatelli et al. 2012).GLMM is suitable for the present problem because the response variable is a binary variable encoding the correctness of a classification (1, 'Correct'; 0, 'Incorrect') and random variation can be modelled for the subjects (analysts) and the test items (bones).The present study additionally uses the species of the animal as a source of variance for each analyst.
The variables modelling random variance (i.e.heterogeneity between clusters) are collectively called random effects, and the fixed effects are those factors whose levels are determined by the experiment and are the main interest of the study (Bolker et al. 2009;Moscatelli et al. 2012).However, note that both random and fixed effects act as explanatory variables and their parameters are estimated through maximum likelihood estimation (Bolker et al. 2009).Models that implement random effects aim to incorporate variance between clusters-in the present study, the clusters are crossed so that each specimen may have been seen by multiple analysts, but each analyst's ability also varies by the species of the specimen.In practice, this model definition is achieved by allowing each analyst and specimen to have their own intercepts, and the slope for each analyst is defined by species.In other words, the random effects in the models discussed in the results section account for variance between analysts (be it due to skill or otherwise), between specimens (due to differences in morphology, quality of the photograph, or any other reason), and between species within the analyst (because the analyst may be biased towards one species).These variances have to be estimated as measuring them directly is very difficult.Fixed effects on the other hand are the factors of interest, namely the different reference material types and analyst expertise groups.
The following example model lends heavily from Barr et al. (2013) who provide a step-by-step explanation of defining a linear mixed model and which is here adapted for GLMM.In this example, the binary response variable Y si is modelled by a fixed effect X (e.g.species) and two random effects (subject, S, and item, I) with a slope of X varying within subject: As the response variable is a binary variable in the present study, a logit link function is used to relate the observations and the predictors.In the above equation, g −1 (•) is the inverse of logit(•) link function, and E Y si |S 0s , S 1s , I 0i is the expected value of the sth subject for the ith item when given by-subject random intercept S 0s , by-subject random slope S 1s , and by-item random intercept I 0i .0 is the overall intercept, 1 is the overall slope, and X i is the fixed effect predictor dummy variable for the ith item (e.g. a level of species such that 0, 'Goat'; 1, 'Sheep').The term e si models the residual error for the ith item and sth subject.
0 , 1 , S 0s , S 1s , and I 0i are the parameters estimated through maximum likelihood estimation.The distribution of variance parameters ( S 0s , S 1s , I 0i ) is usually assumed to be Gaussian with a mean of zero, although non-Gaussian random effects have been suggested (Lee andNelder 1996, 2006).The variance distributions for the example model are Here, 2 00 is the random intercept variance for subjects, 2 11 is the random slope variance for subjects, 2 00 2 11 is the intercept-slope covariance for subjects, and 2 00 is the random intercept variance for items.These parameters then tell us if the inclusion of S 0s , S 1s and I 0i random effects in the model is of importance, because if they are associated with zero or very low variance, it would indicate that there is little to no effect on the probability of correct answer between analysts and/or specimens.For a deeper explanation of GLMM, see Agresti (2002) or Stroup (2013), and for tutorials on implementing GLMM in R, see Baayen et al. (2008) and Bolker et al. (2009).

Finding the best-fit GLMM
The process of finding the best fit GLMM involves confirming that the inclusion of random effects is sensible.This can be done by fitting a baseline GLM and comparing a series of GLMMs that each have varying combinations of , 187 Page 8 of 35 random effect structures to the baseline GLM.At this stage, the GLMMs and the GLM do not have fixed effects.The comparison of the baseline model and GLMM with random effects is tested through likelihood ratio tests (LRT), and the best random effect structures are chosen based on the lowest Akaike information criterion (AIC), Bayesian information criterion (BIC), and negative log likelihood.The best random effects structure is not the final model, and it is extended by adding fixed effects in varying combinations.
The models fitted with fixed effects and the model with the best random effects structure are further compared using LRT to show that the fixed effects add useful information to the best random effects model.With many fixed effects, multicollinearity may become an issue, so models presenting such characteristics are ignored.

Software and packages
All models are created using the lme4 (Bates et al. 2015) package in R version 4.1.1.(R Core Team 2021).Laplace approximation is used to estimate the model parameters with the help of the BOBYQA (Bound Optimization BY Quadratic Approximation) optimizer.Regarding multicollinearity, models with variance inflation factor of above 5 are considered above the critical threshold, although higher and lower thresholds are also common (see Zuur et al. 2010;Thompson et al. 2017).Over-and underdispersion are unidentifiable for binary response values, so they are not evaluated for the models (Kain et al. 2015).

Auxiliary measurements
All classifications were timed by starting a timer as the page loaded and stopped as the participants progressed to the drawing task.As the classification page contained multiple tasks, the measured response speed is not a direct measure of response speed, and this metric therefore has a potentially low signal-to-noise ratio.For this reason, analysis of response speed is not included as part of the main text, but it is available in Online Resource 7.
In addition, the width of the browser window was measured because the study was designed to be shown in a browser window with a minimum width of 1586 pixels.Analysis regarding the impact of browser window size shows that those analysts with browser window widths less than 1586 pixels did not perform markedly worse than those with wider browser windows, and all analysts are therefore included in all analyses (Online Resource 8).

Results
All statistics are performed in Python 3.6 using the SciPy statistics package (Virtanen et al. 2020), apart from the GLMM analyses of reference materials for which the lme4 package in R has been established as one of the standard packages.

Analyst expertise groups
A summary table of analyst expertise is presented in Table 3.The majority (34) of the analysts have a master's or a doctorate degree with a zooarchaeological component, but only three of the participants perform zooarchaeological identifications full-time.The majority of the participants (32) consider themselves as land mammal specialists, and 25 participants also say they have experience of separating sheep and goat bones.Using the information in Table 4 as input to PCA, the first three principal components of the PCA explain approximately 82.95% of variance (Figure 2A).The first three components are then used as the inputs to K-medoids.Evaluating the K-medoids output through the Calinski-Harabasz index and the sum of squared distances (elbow method) results in the division of the analysts into four groups (Figure 2 B and C).Although the Silhouette coefficient results in two clusters (Figure 2D), the difference between the coefficients of two and four clusters is small.
On this basis, the analysts are divided into four groups, with each analyst's group membership indicated in Table 4.
The first (Group 1) of these four groups is interpreted as being formed of professionals (e.g.commercial zooarchaeologists) because it mainly includes full-time or near full-time workers who also have analysed many different assemblages in the past 5 years.Group 2 are relative novices at classifying land mammals and especially sheep and goat bones, while they also have not generally worked on many assemblages nor do they spend a lot of time acting as zooarchaeologists.Group 3 members are classed as postgraduates due to all of them holding master's qualifications.Group 3 members also all expressed having experience in separating sheep and goats, but they do not tend to spend as much time identifying bones as members in Group 1. Finally, participants in Group 4 are classed as doctorates because they all have PhD qualifications, and these analysts also have experience in separating sheep and goat bones.Group 4 analysts are similar to Group 3 analysts in that they do not spend a lot of time identifying bones.The expected order of performance for these groups from worst to best is as follows: Group 2, Group 3, Group 4, Group 1.The Group 1 analysts have been placed ahead of Group 4 analysts in our expectations on the basis that the professionals are more active in using their skills, which we assume to be more important for performance than the highest level of zooarchaeological qualification.We also expect the doctorates (Group 4) to outperform the postgraduates (Group 3) based on their higher level of education, while novices (Group 2) are expected to be the worst-performing group based on their lack of familiarity with sheep and goat differentiation and land mammal specialisation.

Analyst performance
The four most accurate analysts (three of which are in Group 4) managed to correctly classify 29 of the 30 sets of astragali, giving them an accuracy of 96.67% (Table 5).Their accuracy is far better than the average accuracy for all analysts, which is 81.15%.The least accurate participant is Analyst 68 in Group 1, whose accuracy (53.33%) is indistinguishable from chance.The mean accuracy for analysts in Group 4 (87.27%) is ten percentage points higher than for analysts in Groups 1 (77.08%) and 2 (77.13%), with Group 3 analysts (82.82%) also faring better than Group 1 and Group 2 analysts (Table 6).However, the median accuracy for analysts in Group 1 (80.00%) is better than for analysts in Group 2 (73.33%) and much closer to Group 3 median (81.67%).The median performance in Group 4 is 93.33%, which shows that the distribution of analyst accuracies is skewed left, just like it is for Group 1, whereas Group 2 is skewed right, and Group 3 is only negligibly skewed left.It is therefore inferred that Groups 1, 2, and 4 may contain outlying analysts.The boxplots in Figure 3 confirm this suspicion, and it is further noted that Group 3 has one outlying analyst who performed far better than most for that group of analysts.
The analysts generally performed better when identifying sheep (84.74% accuracy) than goat (77.56%) astragali (Table 7).The median for both species is slightly higher than the mean, with a median of 86.67% for sheep and 80.00% for goat.Group 4 analysts (mean: 93.33%, median: 100%) are the best at identifying sheep astragali,  but Group 3 had the highest mean accuracy for goat astragali (mean: 83.09%, median: 83.34%) while also being the only group of analysts that has similar accuracies for both species.However, Group 4 analysts had the highest median accuracy for goat astragali (mean: 81.21, median: 86.67%).
As the sheep and goat accuracies for all analysts were found to violate normal distribution in the Shapiro-Wilk test (Table 8), the Mann-Whitney U test was performed, and it was found that analysts are overall more accurate in classifying sheep than goat astragali (U = 965.0,p = 0.0393, N = 39).Cohen's d effect size (0.4892) indicates a moderate effect as Cohen's d values of 0.15, 0.36, and 0.65 represent the thresholds (based on empirical evidence) for small, medium, and large effect sizes, respectively (Lovakov and Agadullina 2021).
The analyst performances for overall sheep and goat accuracies were further subjected to statistical tests to verify the observed between-group patterns.As Group 4 scores did not satisfy the assumption of normality in the Shapiro-Wilk test (Table 8), the group-wise scores were transformed using Box-Cox power transformation, but the group-wise scores did not satisfy the equality of variance assumption in Levene's test (W = 8.2307, p = 0.0003).Using other power transformations did not help with transforming the data to satisfy the assumption of normality, and therefore, Kruskal-Wallis H test (Kruskal and Wallis   (Dunn 1964;Dinno 2015).This analysis demonstrates that there is no evidence in favour of any one group of analysts having higher overall accuracy than at least one other group (df = 3, H = 7.0575, p = 0.0701).Likewise, the test did not provide evidence for group-wise differences in goat astragalus accuracy (df = 3, H = 4.4727, p = 0.2147), but there does appear to be a group-wise difference in the analysts' ability to classify sheep astragali (df = 3, H = 9.7208, p = 0.0211).Thus, performing Dunn's test to discover the group-wise differences in sheep astragalus accuracies, it is shown that there is a significant (when α < 0.05) difference between Group 2 and Group 4 analysts (Group 2 and Group 4: p = 0.0269), but not between any other two groups (Group 1 and Group 4: p = 0.1876; Group 3 and Group 4: p = 0.1296; for all other group-wise comparisons: p = 1).
Although the only between-groups difference identified in statistical testing was found between the analyst groups 2 and 4 regarding sheep astragalus accuracy, Figure 3 additionally implies that Group 4 outperforms the other groups in overall and sheep accuracy, but not in goat accuracy.It may be that the number of participants is not large enough to produce statistical significance in a rank-based statistical testing, even though graphical evaluation demonstrated a clear difference between groups.

Analyst consistency
Overall analyst consistency is 86.41%, which indicates a good but not excellent consistency among the participants (Table 9).Groups 1 and 4 are more consistent than Groups 2 and 3, suggesting that more experienced analysts are more consistent.Considering that the Shapiro-Wilk test for normality (Table 10) shows that Group 4 consistency scores do not satisfy the assumption of normality-and Box-Cox power transformation was unsuccessful-Kruskall-Wallis H test was again applied to the untransformed values.The outcome of this test does not support the hypothesis that any group of analysts is more consistent than any other group, however (df = 3, H = 4.4148, p = 0.22).
The Bland-Altman plot (Figure 4) shows the difference in accuracies between the test and re-test samples (N = 10) against the mean accuracy of these same samples and therefore demonstrates the relationship between accuracy and consistency for all groups-Group 4 is the most consistent and the most accurate, whereas Group 1 analysts are more consistent but not more accurate than Group 2 and Group 3 analysts.Thus, although statistical testing did not demonstrate significant differences in consistency across the analyst groups, the graphical analysis suggests that the more experienced analysts are more consistent.It may be that group size is again a factor in the statistical tests, the same as for the analyses of group-wise differences in accuracy.

Reference materials
The analysts were allowed to use any reference materials they thought would be helpful.Indeed, when prompted about the usefulness of reference materials after the main part of the study, only three analysts mentioned that reference materials were not useful, whereas 30 analysts found their reference materials having been useful and six analysts did not answer the question.The participants' reference material usage is itemised in Table 11, which shows that none of the analysts relied solely on goat astragalus in any of the media, but seven analysts considered it appropriate to rely on a physical sheep astragalus.Two of those seven analysts used reference images of sheep as well.The counts of different reference material types are shown in Table 12.
In addition, there appear to be group-wise trends as well, which are summarised in Tables 13 and 14.In short, Group 1 analysts use physical reference specimens more frequently than analysts in other groups, Group 2 and Group 3 analysts prefer using images and sketches, and Group 4 analysts refrain from using reference materials apart from reference texts.Regarding the analysts' use of reference texts, Groups 1, 2, and 4 prefer Zeder and Lapham's (2010) publication and Group 3 prefer Boessneck et al. (1964) or Boessneck (1969).Furthermore, Figure 5 shows that Group 3 analysts  are more likely to use many different reference sources, whereas Group 4 analysts tend to use just one.

GLMM for reference materials
In the first instance, four different random effects structures are compared to a baseline GLM through LRT (Table 15).LRT is a suitable test for mixed-effects models when the model only contains random effects (Bolker et al. 2009).
When applying LR testing to fixed effects models, the model parameters should be estimated through maximum likelihood rather than restricted maximum likelihood, and the sample size should be large (Bolker et al. 2009).As there is no clear definition of what constitutes a large sample size (we consider our sample size of 1168 classification to be adequate, although these come from 39 participants) and because Bolker et al. (2009, p.132) 'would recommend against using the LR test for fixed effects unless the total sample size and numbers of blocks are very large' for using LRT, we also report AIC scores.It was found that the best combination of random effects is one where species ('Goat' = 0, 'Sheep' = 1) acts as random slope within the analyst random effect, while both analyst and specimen are the random intercepts.Species is also used as a fixed effect in this configuration.The best random effects model is Null 4 to which the different types of reference materials are added as fixed effects.
As reference specimens, images, and sketches were not utilised uniformly (see Table 11), their levels were re-formatted to binary ('No' = 0, 'Yes' = 1) levels to indicate whether the analyst used a given reference material.
Regarding reference text usage, Prummel and Frisch (1986) was not used by anyone and therefore reference text fixed effect has four levels ('None' = 0, 'Boessneck' = 1, 'Other' = 2, 'Zeder and Lapham' = 3).3D model usage was too infrequent to be of use, so it is not taken into consideration.The response variable is a binary variable corresponding to whether the classification was correct or not.
Using each reference material type individually as the fixed effect, it was found that only reference specimens (Ref 3 model) and reference images (Ref 4 model) added more information to the Null 4 model (Table 16) in LRT and AIC.Furthermore, it was found that adding all reference materials (Ref 5 model) and analyst grouping (Ref 6 model) as fixed effects is also justified given the LRT and AIC results in Table 16.Incorporating interactions of the different reference materials as well as the analyst groupings, however, resulted in an overfit model, and this model is therefore not reported here.Although the likelihood ratio test did not show a significant improvement for Ref   18).The formal definition of Ref 6 model is provided in Online Resource 9.
Ref 6 model produces log odds estimates of a correct answer given the species of the specimen, analyst expertise group membership, and the combination of reference materials used by the analyst, conditional on specimen and analyst random effects.Observing the coefficients for this model (Table 19), it is notable that none of the individual reference materials has a significant and positive effect on the probability of a correct answer as is the case for the Ref 5 model.The only variable with a significant effect on the log odds of a correct answer is whether the analyst is in Group 4.

Self-reported confidence
Self-reported confidence was measured on the scale from 1 to 100 with a default value of 51.Out of the total of 1168 classifications made by the participants, the default confidence value was not changed in 225 instances.Thus, this section involves the evaluation of both corrected (N=943) and uncorrected (N=1168) confidence values in separate tests, where the corrected subset of classifications simply means the removal of classifications with a confidence value of 51.Furthermore, although the self-reported confidence values represent interval data, it can be argued that they should be represented through Likert-like scale due to obtaining the values using a sliding scale.Thus, the classifications are placed into four confidence bins: For both corrected and uncorrected sets of classifications, χ 2 -tests were performed to see if there are statistically significant differences in the expected and observed frequencies of correct and incorrect answers for the four different confidence categories.The results of this test are shown in Table 20 -the χ 2 -tests were statistically significant in both cases when α < 0.05, which means that the number of correct and incorrect classifications are dependent on the confidence category.Although the association between the confidence category and the classification correctness is not particularly strong according to Cramér's V measure, the relationship between confidence categories and correct answers is quite clear in Figure 6, especially for corrected confidence scores.This plot shows that as analysts are more confident about their classification, the ratio of incorrect to correct answers is reduced, although the error rate is still 10.46% in the highest confidence category.Thus, replacing sheep/goat identification with strict species identifications accompanied by self-assessed confidence scores for each bone could increase the overall accuracy if one only takes those classifications with the highest confidence category into account.Reduced error rate would have the effect of reducing noise in subsequent analyses and therefore enable statistically more powerful zooarchaeological research.
Applying different thresholds to the confidence scores (x-axis) and plotting them against the mean accuracy for the classifications with a confidence score above that threshold (y-axis), it can be shown (Figure 7A and C) that the selfreported confidence threshold that maximises the classification accuracy in the present task is 96, when the mean accuracy reaches 95.11%.However, the mean accuracy begins to plateau around the 85 mark, when the mean accuracy is 93.33%.In addition, there is a large difference in the number of classifications that are taken into account at these two different thresholds-when the threshold is set at 96 (inclusive), we are only including 15.16% of the classifications in the corrected set of classifications, whereas at 85 (inclusive) level, we include 31.81% of classifications.To include at least 50% of the classifications, the threshold has to be set to 75 (inclusive), when the mean classification accuracy is 88.77%.To include at least 75% of the classifications, the threshold must be lowered to 57 (inclusive), when the mean classification accuracy is 86.08%.
In Figure 7B and D, the mean accuracy for classifications below the given confidence threshold is presented for the uncorrected and corrected datasets, respectively.These figures demonstrate that lower confidence scores are less informative than higher confidence scores because of the large confidence intervals-there are far fewer classifications with low confidence scores, but classifications with low confidence scores cannot immediately be said to be misclassified.Instead, Figure 7B and D mainly reflect the methodology for obtaining the self-reported confidence scores in that the lower end of the sliding scale was labelled 'Guess', so the expected accuracy for a classification with a confidence score of 1 is 50%.The trend in these figures is that the analysts' accuracy is slightly above the expected accuracy, but the expected accuracy is within the 95% confidence interval of the empirical data.

Between-group differences in confidence scores
In addition to the correct and incorrect answers being dependent on the confidence scores for all answers, there may exist between-group differences in this relationship.The bar plots in Figure 8 show that the error rates for analysts in Groups 3 and 4 are much lower for the 'Very high' confidence category than in the other three confidence categories.Although the overall pattern is similar for Groups 1 and 2, the effect is much smaller.χ 2 -tests were not suitable  for finding out the relationship within analyst groups as the expected values were less than five in more than 20% of the cells for all but Group 2, which is a commonly cited minimum threshold for using χ 2 -tests for independence (Bewick et al. 2004).Thus, it was opted to use the continuous confidence scores rather than the categorical values in the comparison of analyst expertise groups.
These continuous confidence values did not satisfy the assumption of normal distribution (Table 21), and therefore, Kruskal-Wallis H test (Table 22) followed by Dunn's test (Table 23) with Bonferroni correction was performed.These tests demonstrate that Group 2 confidence scores are significantly different from the other groups when α = 0.05, and Group 3 confidence scores are additionally significantly different from Group 4 confidence scores when only taking the corrected set of confidence values into account.This result implies that less experienced analysts are less confident about their decisions, as would be expected.However, taking the analysts' mean confidence scores and plotting them with boxplots (Figure 9), it can be shown that Group 3 analysts' mean confidences are not much different from Group 4 analysts' mean confidences.This difference in the conclusions drawn from the boxplot and Dunn's test can be explained by the fact that raw confidence scores of individual answers were used in Dunn's test, whereas the boxplots display analysts' mean confidences.Thus, while analysts in Group 3 are not, on average, less confident than analysts in Group 4, it may be that Group 3 participants were shown astragali that were, by chance, morphologically more ambiguous than those shown to Group 4 analysts.

Between-species differences in confidence scores
Next, it is aimed to answer the question of which species the analysts are more confident about classifying.This part of the analysis was devised due to seven analysts opting to use a physical sheep astragalus as a reference, and it was of interest to find out whether the analysts were more confident about classifying sheep astragali.This hypothesis was also informed by the general observation that suitable goat astragali were harder to find for inclusion in the present study and that the literature on goat bone development is not as extensive as for sheep, which leads us to believe that there may be a bias in the zooarchaeological community towards a higher confidence in identifying sheep astragali.First, it was tested whether the confidence scores for sheep and goat bones followed normal distributions.This test was performed using Shapiro-Wilk test, which shows that the confidence scores for sheep (uncorrected: W = 0.9429, p = 3.39e-14; corrected: W = 0.8918, p = 1.05e-17) and goat (uncorrected: W = 0.951, p = 5.25e-13; corrected: W = 0.9062, p = 1.81e-16) violates the normality assumption, and thus, Mann-Whitney U test was performed (Table 24).This test supports the hypothesis that analysts are less confident about their decision when classifying goats, although the effect size (Cohen's d) suggests that the species of the bone has only a small effect on confidence (see Lovakov and Agadullina 2021).Furthermore, even though the median and mean confidence scores for sheep tend to be higher than for goat, within-group Mann-Whitney U tests did not produce any statistically significant results for uncorrected or corrected sets (Table 25).

Specimen difficulty
Next, we want to identify difficult and easy specimens, which is achieved by computing the difficulty index (also known as the facility index) for each specimen.Difficulty index simply measures the proportion of answers that were correct for a given specimen, meaning that the lower the index, the harder the specimen.This index is first used in comparing the levels of difficulty between the two species because we observed that species impacted the analysts' confidence scores and accuracies.We then explore the between-group differences in the difficulty index as some groups may have been shown easier specimens by chance.Finally, we perform a correlation test to see if the difficulty correlates with analysts' mean confidence.However, note that specimens with low numbers of classifications may over-or under-estimate the difficulty of the specimen.

Between-species comparison
The difficulty index is shown for each specimen in Figure 10, demonstrating that a large proportion of specimens is very easy, and only a small number are very difficult.The specimens with red labels in the x-axis represent sheep, and the blue labels represent goat astragali.Difficulty index violates the assumption of normality in the Shapiro-Wilk test for both species (Sheep: N = 99, W = 0.741, p = 6.33e-12;Goat: N = 84, W = 0.8161, p = 7.6e-09).Thus, the Mann-Whitney U test was performed, showing that sheep are significantly (at α < 0.05) easier than goats for the analysts to classify (N sheep = 99, N goat = 84, Mean sheep = 0.8344, Mean goat = 0.7546, U = 4896, p = 0.03).The Cohen's d effect size is small to moderate (d = 0.3093), and Figure 11 indicates that the harder sheep astragali are outliers.The spread of difficulties is also shown as a kernel density estimate plot in Figure 12 for all specimens.

Between-group comparison
As it was observed that there are differences in analyst performances and confidence scores between expertise groups (albeit not statistically significant), it is of interest to verify that the difficulty of the specimens seen by analysts in different groups does not vary significantly between groups.After computing the mean difficulty of the specimens seen by each analyst, the within-group difficulties passed the Shapiro-Wilk tests for normality for all groups (Group 1: W = 0.9358, p = 0.5698; Group 2: W = 0.9150, p = 0.2473; Group 3: W = 0.9702, p = 0.8999; Group 4: W = 0.9506, p = 0.6515) and Levene's test for equality of variance (F = 0.5304, p = 0.6644), and thus, one-way ANOVA is performed.The result of the ANOVA test is that the specimen difficulties between analyst expertise groups are not significantly different (F = 0.2410, p = 0.8672), and this conclusion is supported by the kernel density estimate plot for group-wise item difficulty (Figure 13).We can therefore be confident in the assumption that the random selection of specimens for each analyst did not significantly affect any of the analyses.

Correlation tests
The specimen difficulty is expected to correlate with the analysts' mean confidence.Because we are using the corrected set of confidence scores here, the tests are performed on a reduced number of specimens.As the specimen mean confidences violate the assumption of normality in Shapiro-Wilk test (W = 0.9731, p = 0.0016, N = 179) as does the specimen difficulty index (W = 0.7390, p = 1.71e-16,N = 179), Spearman's rank correlation test is performed, which demonstrates a positive correlation (r = 0.3778, p = 1.85e-07,N = 179).Thus, as specimens have a higher difficulty index (i.e. they are easier), the analysts are more confident about their classification, although the correlation is not very strong.
The strength of the correlation between mean confidence and item difficulty is not explained by the number of responses for a given specimen, as shown by Spearman's correlation test between the number of responses and difficulty index (r = −0.0186,p = 0.803, N = 183).This is further demonstrated in Figure 14, where the mean difficulty index for specimens (triangles in boxplots) is plotted against the number of responses by analysts (x-axis).This figure shows how the mean (triangles) difficulty index remains relatively stable between 0.7 and 0.9 for all items with 13 or fewer responses while the number of specimens ranges from five to  25 in the same interval.The three specimens with 14 or more responses appear to have been very easy for the analysts, as two of them have a difficulty index of 1, and the specimen with 17 responses has a difficulty index of just below 0.9. Figure 14 additionally shows this phenomenon for both species, demonstrating how the mean difficulty index (triangles) is more often higher for sheep when specimens are grouped by the number of times they were attempted.

Qualitative analysis of the analysts' areas of attention
As part of the study, the participants were asked to paint on top of the images of the bones the features that they used in their classifications.This resulted in hundreds of drawings for both species, and these drawings are used here in a qualitative manner with respect to the zooarchaeological descriptions typically used for differentiating the sheep and goat astragali.First, the overall patterns are explored, and then we turn our attention to the easiest and hardest specimens of each species.

Overall analyst attention
The analysts' drawings of the regions of the bones pertinent to the classification (Figure 15A) follow very closely the regions included in the descriptive criteria defined by Zeder and Lapham (2010) and Boessneck (1969;Boessneck et al. 1964)-the medial articular ridge in dorsal and plantar views, the distal articular surface in lateral view, and the proximo-plantar projection in medial view are the most often drawn on regions for both species.The number of drawings for each view is very similar for the two species (see counts in Figure 15A), so there are no large differences in the execution of the drawing task between species.
To obtain a more nuanced understanding of the analysts' behaviour, the drawings of correct (Figure 15B) and incorrect (Figure 15C) answers are compared.As would be  expected, when the classifications are correct, the analysts' drawings and the descriptive criteria are in synchrony-the features drawn on sheep and goat astragali correspond to the descriptive criteria associated with the respective species.However, when the analysts are incorrect in their classifications, the drawn areas of goat specimens now correspond to sheep-like qualities in descriptive criteria, and the sheep drawings correspond to goat-like qualities.Thus, the analysts make their decisions by following descriptive criteria very closely, but as the criteria do not fully encapsulate the variances of the morphologies of the two species, the analysts make mistakes.This pattern is easiest to discern in the drawings for the dorsal and medial views of the correct and incorrect answers.In the dorsal view, the medial articular ridge is described as being generally oblique for goats and horizontal for sheep, but the opposite is true for incorrect classifications.Concerning the medial view, the proximo-plantar projection is usually described as being more rounded for sheep and pointed for goats, but again, the opposite is true for the drawings of the incorrect answers.

The easiest and the hardest specimens
To underline this overall pattern, we turn to the easiest and the hardest specimens as examples of cases when the morphological shape of the bone does not fit within the variation covered by descriptive criteria.The easiest specimens are chosen on the basis of which bones had the highest difficulty index and the most classification attempts, whereas the hardest specimens are chosen based on the lowest difficulty index and highest number of classification attempts.The easiest sheep is Portsmouth 3113 L (correctly classified   in all 14 classifications, Figure 16A) and the easiest goat is Cardiff 57 L (correctly classified in all 15 attempts, Figure 16B).The hardest sheep is Portsmouth 3567 L (incorrectly classified in all three attempts, Figure 16C) and the hardest goat specimen is Cardiff 77 R (incorrectly classified in all seven cases, Figure 16D).L and R indicate the side of the animal the specimen comes from.The red areas in these images correspond to all drawings made by the analysts.
To begin with, the proximo-plantar projection and medial articular ridge in medial view are very similar for the easiest sheep (Figure 16A) and the hardest goat specimens (Figure 16D), showing the overlapping morphological variances of the two species.Furthermore, the medial articular ridge for the hardest goat specimen, Cardiff 77 R, is nearly horizontal with respect to the longest axis of the bone in dorsal view, which would normally be associated with sheep (Boessneck 1969;Zeder and Lapham 2010).The angle of the medial articular ridge for Cardiff 77 R specimen is even more horizontal than for the easiest sheep, Portsmouth 3113 L. It is therefore quite understandable that this bone was problematic for analysts if they followed the instructions in descriptive criteria.However, Cardiff 77 R has morphological qualities that may make it vary from the expected.For instance, Cardiff 77 R has a general curvature that can be seen in the dorsal view, which may result in the medial articular ridge becoming more horizontal as well as reduce its prominence in the medial view.In the plantar view of this specimen, the plantar articular surface is also distinctly shifted towards the lateral side, which may be a further symptom of the bending of the astragalus.This particular specimen can therefore be likened to a spring that  is pressed unevenly such that as one side is compressed, the opposite side bulges outwards.Even though the morphology of Cardiff 77 R does not conform to the expected morphology, it is not possible to say definitively that this bone is abnormal because we simply do not know the population variance.In other words, although Cardiff 77 R is just one sample, bones with morphologies beyond the descriptive criteria and other comparative methods may be more common than currently thought given the vastness of the archaeological record.Current comparative methods (especially those depending on written and/or drawn descriptions) therefore underestimate the importance of morphological variation as the variation is reduced to single points in a multidimensional space.
The hardest sheep and the easiest goat specimens, too, have shared morphological qualities.The hardest sheep-Portsmouth 3567 L (Figure 16C)-has an oblique medial articular ridge in the dorsal view that is quite prominent in the medial view, and the proximo-plantar projection in the medial view is very pointed, all of which are features normally associated with goat-like morphologies (Boessneck 1969;Zeder and Lapham 2010) and are found in Cardiff 57 L (Figure 16B).Yet, the distal articular surface in the lateral view of Portsmouth 3567 L is not tear-drop shaped (a feature typically considered goat-like) as it runs across the entire lateral face of the bone, which is as expected from a sheep (Zeder and Lapham 2010).
To summarise, the medial articular ridge and the pointiness of the proximo-plantar projection are the most likely sources of much of the confusion for human analysts due to the analysts' dependence on these features in both dorsal and medial views.The medial articular ridge obliqueness in dorsal view is not very different between the hard and the easy specimens (apart from Cardiff 77 R, for which we gave a possible explanation), whereas its prominence in medial view is very similar between the easy goat and the hard sheep as well as between the easy sheep and the hard goat.Likewise, the proximo-plantar projection is pointed for hard sheep and easy goat, but less so for hard goat and easy sheep.These examples therefore underline the argument that morphological descriptions do not apply to individuals whose morphologies vary from the morphology of those samples that were used in defining the original descriptive criteria.Thus, even though these features are good individual features to separate a large proportion of sheep and goat astragali, the overlapping morphological variance is such that zooarchaeology as a discipline could benefit from a holistic approach to classifying bones to species.In other words, instead of typifying bone morphologies to simple rules and specific features, the whole bone morphology should be considered for more reliable classifications in the future, especially as the features of a given bone are dependent on all other features of that same bone as well as the articulating bones.Bones should therefore be thought of as continuous geometries, not as a series of discrete features.Of course, the observations made here on a limited number of bones require further quantitative testing, but this is beyond the scope of this article.

Discussion
The results of the blind study are complex, but we have achieved our three main aims.First, we have tested the impact of different types of reference materials in the classification of sheep and goat astragali.Second, we have analysed the participants' spatial attention.Third, we have shown that selfreported confidence scores can be used in place of the ambiguous sheep/goat category.The additional benefit of our study is that it can be used as a benchmark for an archaeologically relevant image dataset for forthcoming deep-learning models.
However, we acknowledge that the results may be affected by the method of acquiring the participant data since zooarchaeologists do not normally classify bones only from photographs, and instead, zooarchaeologists aim to inspect the bones carefully in person.The images used also did not incorporate relevant size information to help the analysts differentiate the species, which may affect the results.The removal of size information was necessary because it is not usually part of the described morphological criteria, although osteometric analyses demonstrate size differences between sheep and goat astragali (Davis 2017;Haruda 2017;Salvagno and Albarella 2017).
Although it has previously been found that there is little agreement in the identifications of pottery and lithic artefacts when they are based on digital photographs and when the identifications are performed in a laboratory (Heilen and Altschul 2013), Heilen and Altschul's (2013) study failed to account for the fact that two different analysts were involved in the classification of the materials, so their observed discrepancy may simply reflect the two analysts' capabilities.We therefore argue that even though the analysts' performances could be better if performed in person, there is no firm evidence that using digital images is the cause of lower classification accuracies.In fact, only one participant (Analyst 104) raised any concerns about the photographs, saying that a couple of the photographs were 'blurred at the proximal end', 'truncated', and 'speckled or very light/white and it was not clear where the surfaces were'.Analyst 104 also mentioned that lighting and angle could be adjusted when analysing bones in person.Finally, we would further counter the argument that using images rather than physical specimens somehow impacted the results by arguing that identifications of bones from images are not infrequent, and they typically occur when an analyst is faced with a difficult morphology, and they want to consult with colleagues elsewhere in the world.

Analyst performance
The blind study demonstrates that the consistency of human experts is good, but their overall accuracy (81.15%) is lower than the accuracy reported (89.36-94.74%,see Tables 1 and 2) by Zeder and Lapham (2010).This difference may be explained by the differing methodologies and the number of participants.Furthermore, it was observed that human error rates for sheep (15.26%) are lower than for goat astragali (22.44%) and that the analysts were also more confident about classifying sheep astragali.The noted differences regarding accuracy and confidence are reflected in how difficult the bones of the two species were for the analysts overall, with sheep bones generally being easier to identify.This observed difference between species means that using comparative methods of identifying archaeological remains of sheep and goat astragali are likely to lead to a bias towards sheep astragalus identifications.Our study therefore raises the question of how large of an impact this bias has on sheep-to-goat ratios.We discuss the impact of confidence and especially in relation to sheep-to-goat ratio in more detail in a separate section below.

Group-level differences
Although statistical tests were unable to find clear differences between the analyst groups apart from sheep astragalus accuracy between Group 2 and Group 4, we consider these expertise groups to be valid and display different levels of abilities based on the performance boxplots in Figure 3 and consistency plot in Figure 4.However, the expertise groups themselves were not found to match expectations; prior to the analysis of the group performances, we set the expectation that Group 1 analysts (the professionals) would outperform other groups due to zooarchaeological identification tasks being part of their daily routine, followed by Group 4 (doctorates), Group 3 (postgraduates), and finally Group 2 (novices).However, our study shows that Group 4 analysts were the most accurate and consistent group even though they tended to have worked on fewer number of assemblages and spent less time on identification tasks per week than Group 1 participants.Moreover, Group 4 outperformed other groups despite their lack of reliance on reference materials.Taking all of the evidence regarding Group 4 (i.e.high accuracy, high consistency, high confidence, infrequent use of reference materials, PhD level education, and relatively few hours spent in identification tasks) into consideration, we can be confident in stating that they were the true experts.Thus, the group performance results are at odds with our expected order.One possible explanation for the discrepancy between our expectations and the results may be that our study design does not truly capture expertise in sheep and goat astragalus differentiation.We do not think this is the case, since the deviation of Group 1 performance from the expectation (i.e.observed rank 3 vs expected rank 1) is enough to explain the overall deviation from the expected order of the group performances.We can also clearly see that novices (Group 2) are worse than postgraduates (Group 3) who are worse than doctorates (Group 4), demonstrating a progression in performances as the analysts become more highly qualified and better acquainted with sheep and goat separation.Note, however, that the highest level of qualification is not the defining feature of the expertise groupings and other questions also play a role in the creation of the groups.If the study design had been flawed, it is more likely that the rank order of the analyst groups would be closer to random and we would not have been able to place the analysts into such clearly defined groups as the analysts would have been assigned to groups randomly.
Instead, we argue that the hours worked in zooarchaeological identification tasks and how many assemblages the analysts had worked on in the last 5 years do not have a large impact on analysts' ability to separate sheep and goat astragali.We argue this to be the case on the basis that Group 1 analysts were not more accurate and only somewhat more consistent than analysts in Group 2 (relative novices) and those in Group 3 (postgraduates).This result further conforms to the more general assessment of the impact of experience on performance by Ericsson and Lehmann (1996), who state that an increased amount of knowledge gained through experience in a domain does not always lead to superior performance compared to the less-experienced individuals.In other words, sheep and goat astragalus classification is not necessarily a task where analyst expertise can be measured via time spent on zooarchaeological tasks or the variety of tasks.Instead, it is argued that the latent ability to identify shapes, continuous self-improvement, or even a teaching role is more likely to lead to better performance, but this cannot be deduced from the collected data and requires further research.

Inference on reference materials
Surprisingly, we did not find any evidence of reference materials being helpful in the classification task.In the GLMM analysis, using Boessneck (1969;Boessneck et al. 1964) as a reference text was the only type of reference material that was found to have a positive coefficient, while all the other reference materials had a negative coefficient.This result means that not using any type of reference material results in a higher log odds of correctly identifying the species than when relying on reference materials, apart from when using Boessneck (1969;Boessneck et al. 1964), and even then the difference is not statistically significant.The only important factor in having a higher log odds of a correct answer is membership in Group 4. This result may seem flawed to many, since most of us believe that having reference materials is helpful (e.g.Bochenski 2008)-this was certainly the authors' belief as well.However, we argue that reference materials are likely to be most useful when a bone morphology is encountered for the first time or when the morphology is very uncommon, as the analyst has not yet formed a mental frame of reference for that morphology.As this mental frame of reference is formed over time, the usefulness of reference materials wanes.This is evident in the fact that most zooarchaeologists are capable of identifying a large variety of bones accurately without relying on any reference materials, but they are quick to ask for help when encountering an unusual specimen.Likewise, reference materials are often used when trying to differentiate between two very similar morphologies, but as noted, our results do not support such a usage.As Group 4 analysts were much less likely to rely on reference materials, they had the highest accuracy, they were the most consistent, and they were the most confident, we put forward the argument that these analysts have built a mental frame of reference for the morphological differences between sheep and goat astragali, and we again reiterate our argument that they are likely to be the true experts at this task.This result cannot be explained by the selection of bones that Group 4 analysts analysed, since the distribution of specimen difficulties was not found to differ between groups.
Additionally, the result that reference materials were not found to increase the log odds of correct answers could reflect the fact that the population variances are not captured by single reference items (be it a 3D model, physical specimen, image, or sketch).While publications and manuals can draw conclusions from a large number of specimens, they tend to simplify the variation to single points.Physical reference collections are similarly limited as they often only include a handful of specimens for each species and typically contain samples from a limited subset of species.The usefulness of reference materials is therefore likely limited by their incapability to contain variance.Even descriptive criteria may not encapsulate the entire variation as they tend to be gross generalisations that are based on specimens that can derive from many geographic regions or worse, from very specific populations.Indeed, it has been argued that geographic and temporal variation affect the morphology of at least sheep astragali (Haruda et al. 2019;Pöllath et al. 2019).Thus, when an analyst is exceedingly reliant on reference materials, their accuracy may be lowered if they are unable to match the underlying population variances of the reference sample and the test sample, particularly when only parts of the bone morphology are used such as when the study specimen is fragmented.In a forthcoming article, we argue that the mismatch between the population variances and the variance encapsulated in reference materials can be reduced by using deep learning convolutional neural networks as they can take advantage of the entire bone morphology in a classification task, whereas comparative identification is reliant on defined, discrete features.

Impact of confidence
As expected, we found that confidence correlates with one's ability to classify sheep and goat astragali and that human analyst confidence correlates with the difficulty of the specimen, which is again as expected.It was also found that by setting a threshold to confidence scores, it becomes possible to limit the evaluation of an assemblage to only those specimens that the analysts are the most confident about, which increased the average accuracy.This observation is especially true for analysts in Groups 3 and 4, whose accuracies improved from 82.82% and 87.27% for all answers to 95.24% and 94.74% for answers in the 'Very high' confidence category (using a threshold value of 75), respectively.When the confidence threshold is set to 85 for all groups, the mean accuracy is 93.33%, whereas the mean accuracy for all answers is 81.15%.
As analysts are more confident about classifying sheep than goats, limiting the answers to the most confident responses has the downside in that the sheep-to-goat ratio shifts in favour of sheep, regardless of whether one analyses the true ratio, the ratio for correct answers, or the ratio for all answers.We have demonstrated this effect in detail in Online resource 10.Importantly, as low-confidence answers would be likely classified as sheep/goat in zooarchaeological analyses, it is probable that this same effect is present in many published studies, and sheep are therefore more than likely overrepresented in sheep-to-goat ratios that do not take ambiguous sheep/goat classifications into account.

Experience and confidence
In Driver's (1992) view, more experienced analysts are less willing to differentiate between morphologically similar species and therefore are presumably more likely to place bones in the sheep/goat category.However, our study found that analysts in Groups 1 and 4 (the more experienced analysts) have higher confidence scores than analysts in Groups 2 and 3.Because there is no prior information about what confidence level analysts used when they place samples in the 'sheep/goat' class, analysts could find the adoption of self-reported confidence scores beneficial in that it allows the communication of the rejection criteria.For example, the analyst can simply state that they included specimens that they were 75% confident to be correct and excluded those specimens with lower confidence from further analysis.Using self-reported confidence scores therefore informs peers about the analyst's certainty for each specimen, which in turn allows the reviewer to identify potential blind spots 187 Page 32 of 35 in the analysis.The analysts themselves could use this data for self-reflection to find out areas of zooarchaeological identification that they need to improve upon should they keep track of their confidence scores over a long period.Furthermore, one additional avenue for taking advantage of confidence values could be to use them as prior probabilities following Wolfhagen and Price's (2017) methodology.However, unlike Wolfhagen and Price (2017, p.626), who mention that prior probabilities 'express beliefs about the proportion of goats or sheep in an assemblage', confidence scores could be used to express beliefs about the specimen being a goat or a sheep.The difference in these definitions is larger than it seems: instead of using one prior for the entire study sample, confidence scores allow a unique prior for all bones.
In the future, similar studies that incorporate fragmented bones should be conducted because fragmentation has an impact on analyst accuracy (Pickering et al. 2006), which is likely to also have an effect on confidence.Furthermore, as it was found that analysts are more confident and better at classifying sheep astragali, it may be that there is a wider issue of reference materials being biased towards one species, and it would be beneficial to look further into this issue.Overreliance on reference materials that are biased or depict limited variance would lead to systematic errors, which may erroneously lead to inflated or deflated confidence, depending on the population prevalence of the features used for classifying the bones.These errors would then be compounded in regional studies or meta-analyses.

Specimen difficulty
Perhaps unexpectedly, the majority of the astragali were easy for the analysts, while a small number of them were very difficult.The qualitative analysis of the analysts' attention demonstrates how the analysts follow the descriptive criteria very faithfully.It is put forward here that the morphological variation of the difficult bones is at the fringes or beyond the morphological variation expressed by reference materials or indeed the analysts' own mental frame of reference.For example, the obliqueness of the medial articular ridge and the pointiness of the proximo-plantar projection are very similar for difficult goat and easy sheep specimens and vice versa, which obviously leads to a false classification as these features are primary features used in differentiating the two species (Boessneck 1969;Zeder and Lapham 2010).Thus, it is important to involve morphologically more varied references, but doing so is not easy nor always possible due to the availability of specimens or space, let alone the fact that the analyst would have to make several comparisons for each possible species.Instead, computational models (such as deep learning convolutional neural networks) could be used in the future.

Conclusion
Our study has shown that human analysts can be very accurate at classifying sheep and goat astragali from photographs, but on average, their classification error rate is around one in five specimens.We did not find any evidence that using reference materials leads to a higher probability of a correct answer, and we argue that the reason is that morphological variation is not captured well by any single reference material type.Furthermore, analysts were found to be more confident and accurate when classifying sheep astragali, suggesting that either the analysts or the reference materials may be biased towards the identification of sheep.It is likely that this bias is also present in published sheep-to-goat ratios.This research has further shown that self-reported confidence scores can be used in place of the ambiguous sheep/goat class, which has several benefits to the researchers.For example, confidence scores can inform peers about potential blind spots in the analysis, and analysts can use self-reported confidence scores to identify their own strengths and weaknesses if they keep track of the confidence scores.
Finally, analysts' focus maps demonstrate that human experts are somewhat inflexible about which features they use to classify the specimens.This behaviour leads to the misclassification of specimens that do not have very specific features that fall within the morphological variance insinuated by reference materials.As written and drawn comparative materials often have discrete boundaries and zooarchaeologists are trained to follow these guides, any bone that varies from the expected rigid shape has a higher probability of being misidentified.In some cases, it may be enough that one feature of a bone is deemed to have a morphology of another species that all other features are disregarded.Therefore, we conclude that more flexible classification methods are needed for increased classification accuracy.Such a method should take a holistic approach to the species morphology and be able to encapsulate the full variance of the bones.We suggest that deep learning convolutional neural networks are able to achieve this, and their usefulness will be shown in a forthcoming paper.Deep learning convolutional neural network models also have the benefit of being portable as they only occupy digital space and can be utilised over the internet.
Page 33 of 35 187 their help with the zooarchaeology reference collection at the University of Sheffield.Finally, we thank those 39 anonymous participants who completed the blind study, and without whose volunteering, this study would not have been possible.The presented work is an adaptation of the work presented in the first author's doctoral thesis.

Fig. 1
Fig. 1 Example images for a goat astragalus

Fig. 2 A
Fig. 2 A PCA scree plot showing cumulative explained variance.B Sum of squared distance score for K-medoids.C Calinski-Harabasz index score for K-medoids.D Silhouette score for K-medoids.The

Fig. 3
Fig.3Boxplots of accuracy, precision, recall, and F 1 -score for the analysts.The first row shows the metrics for both species, whereas the second and third rows show the metrics for the two species separately.
6 model over Ref 5 (χ 2 = 7.2753, Df = 3, Pr(> χ 2 ) = 0.0636), the focus is nonetheless on Ref 6 model because we are interested in the Group level effect and the AIC score is lower.Any question of whether Simpson's Paradox has occurred is negated by the fact that none of the Ref 5 model coefficients for reference materials is significant inTable 17, nor are they hugely different from Ref 6 coefficients.In other words, whether the data is split by Group (as in Ref 6 model) or all

Fig. 5
Fig. 5 A bar plot of analysts' number of used reference resources.Best viewed in colour, available online

Fig. 6 Fig. 7 A
Fig. 6 Correct and incorrect classifications by confidence categories for the corrected and uncorrected sets.Best viewed in colour, available online

Fig. 8
Fig. 8 Correct and incorrect classifications by corrected confidence categories for all analyst groups.Best viewed in colour, available online

Fig. 9
Fig. 9 Boxplot of mean confidences for the analysts.The green triangle indicates mean value, while the orange line reflects median.Circles are outlying analysts

Fig. 10
Fig. 10 Difficulty index for each specimen.The specimen names are constructed as follows: Portsmouth refers to Historic England collection, Sheffield to Sheffield University, and Cardiff to National Museum Wales.The number in the name refers to their catalogue ID

Fig. 11
Fig. 11 Boxplot of the difficulty index.Outliers are circles, the green triangle is the within-group mean, and the orange line is the withingroup median

Fig. 14 A 3 187
Fig. 14 A Relationship between the number of responses and the difficulty index.The coloured triangles are means associated with the boxplots, and the diamonds are the outlier specimens in terms of difficulty.Note how the means (triangles) do not vary significantly

Fig. 16
Fig. 16 The easiest (top row) and hardest (bottom row) sheep and goat specimens overlain with their average analyst focus maps -the brightness of the focus areas indicates frequency.The sheep are on the left and the goats are on the right side.R in the specimen name refers to the right side and L to the left side of the animal.A The easiest sheep astra-

Table 3
Analysts' expertise summarisedCount of analysts by levels of expertise

Table 4
Itemised answers and analyst group membershipAnalyst expertise groups

Table 5
Analyst performance scores for both species and separately.Note that precision, recall, and F 1 -scores in both species sections are computed as the means of the two species

Table 7
Means and medians for sheep and goat astragali by expertise groups Means and medians by expertise groups for each species

Table 8
Shapiro-Wilk tests for normality for analysts' accuracy overall and for sheep and goats separately.Only Group 4 analysts have a non-undertaken on the untransformed accuracies to compare if at least one group's ranks dominate at least one other group.Kruskal-Wallis H test is followed by Dunn's test with Bonferroni correction to understand exactly which groups are different, as recommended as the post-hoc test

Table 9
Mean consistency based on the ten repeated specimens

Table 10
Shapiro-Wilk test for normality for analysts' consistency.

Table 12
Counts of analysts for each level of all reference materials

Table 14
Summary of groups' use of reference texts

Table 15
Random effects structure selection.Each null model was compared to the baseline GLM.The best random effects structure is marked by * *Page 19 of 35 187

Table 16
Likelihood ratio tests showing that additional reference materials contribute to the model.Null 4 is the best fit null model from Table15.Fixed effects models that are better fit than Null 4 are marked by *

Table 17
Maximum likelihood estimates for Ref 5 model.The effects with significant coefficients are marked by *

Table 18
Multicollinearity measures for fixed effects included in Ref 6 model

Table 19
Maximum likelihood estimates for Ref 6 model.The effects with significant coefficients are marked by *

Table 21
Shapiro-Wilk's test for normality for analysts' mean confidence scores for all groups Shapiro-Wilk's test for normality for analyst mean confidence (group-wise)

Table 23
Dunn's test for continuous confidence scores across analyst groups.Values are Bonferroni corrected p-values at α = 0.05 Dunn's test for continuous confidence scores across analyst groups

Table 24
Mann-Whitney U test for confidences between sheep and goats

Table 25
Mann-Whitney U test for sheep and goat confidence scores for each analyst group.Bonferroni correction is used to compute the adjusted p-value Mann-Whitney U test for sheep and goat confidences, group-wise