Sampling of trait words
Here we follow the definition of a biological trait as being a temporally stable characteristic. Traits in our study include personality traits as well as other temporally stable characteristics that people spontaneously infer from faces, such as age, gender, race, socioeconomic status, and social evaluative qualities (Supplementary Fig. 1a, e.g., “young”, “female”, “white”, “educated”, “trustworthy”). By contrast, we excluded state attributions, such as “smiling” or “thinking” (words that can describe both trait and state variables were not excluded, e.g., we included “happy,” but disambiguated its usage as a trait in our instructions to participants, e.g., “A person who is usually cheerful”).
Our goal was to representatively sample a comprehensive list of trait words that are used to describe people from their faces. We derived a final set of 100 traits (Supplementary Table 1) through a series of combinations and filters (detailed below; also in our preregistration at https://osf.io/6p542). These 100 traits were further verified to be representative of words that people freely generate to describe trait judgments of our face stimuli (Fig. 2a-b).
To derive the final set of trait words, we first gathered an inclusive list of 482 adjectives and 6 nouns that included all major categories of trait judgments of faces: demographic characteristics, physical appearance, social evaluative qualities, personality, and emotional traits, from multiple sources12–15, 19,21, 25–31,33,38,39. Many of the 482 adjectives were synonyms or antonyms. To avoid redundancy while conserving semantic variability, we sampled these adjectives according to three criteria: their semantic similarity (detailed below), clarity in meaning (from an independent set of 29 MTurk participants), and frequency in usage (detailed below). For those words with similar meanings, clarity was the second selection criterion (the one with the highest clarity was retained). For those with similar meanings and clarity, usage frequency was the third selection criterion (the one with the highest usage frequency was retained).
To quantify the semantic similarity between these 482 adjectives, we represented each of them as a vector of 300 computationally extracted semantic features that describe word embeddings and text classification using a neural network provided within the FastText library40; this neural network had been trained on Common Crawl data of 600 billion words to predict the identity of a word given a context. We then applied hierarchical agglomerative clustering (HAC) on the word vectors based on their cosine distances to visualize their semantic similarities. To quantify clarity of meaning, we obtained ratings of clarity from an independent set of participants tested via MTurk (N = 31, 17 males, Age (M = 36, SD = 10)). To quantify usage frequency, we obtained the average monthly Google search frequency for the bigram of each adjective (i.e., the adjective together with the word “person” added after it) using the keyword research tool Keywords Everywhere (https://keywordseverywhere.com/).
The 94 adjectives representatively sampled using the above procedures and the additional 6 nouns consisted of our final set of 100 trait words. To verify the representativeness of these 100 trait words, we compared the distributions of our selected words and of 973 words human subjects freely generated to describe their spontaneous impressions of the same faces (see Supplementary Fig. 1a and Methods below), using the 300 computationally extracted semantic dimensions (Fig. 2a-b).
To ensure that the dimensionality of the meanings of the words that we used was not limiting the dimensionality of the four factors we discovered in our study, we derived a similarity matrix among our 100 words using the FastText vector of their meanings in the specific one-sentence definitions we gave to participants in the experiments (Supplementary Table 1; basic stop-words such as “a”, “about”, “by”, “can”, “often”, “others” were removed from the one-sentence definitions for the computation of vector representations), and then conducted factor analysis on the similarity matrix. Parallel analysis, Optimal Coordinate Index, and Kaiser’s Rule all suggested 13 dimensions; Velicer’s MAP suggested 14 dimensions, and empirical BIC suggested 5 dimensions (empirical BIC penalizes model complexity). We used EFA to extract 5 and 13 factors using the same method as for the trait ratings (13 factors explained the same common variance as 14 factors, 70%; 5 factors explained 60%; factors were extracted with minimal residual method and rotated with oblimin to allow for potential factor correlations). None of the dimensions obtained bore resemblance to our four reported dimensions, arguing that the mere semantic similarity structure of our 100 trait words was not a constraint in deriving the four factors that we report.
Sampling of face images
Our goal was to derive a representative set of neutral, frontal, white faces of high quality (clear, direct gaze, frontal, unoccluded, and high resolution) that are diverse in facial structure. We aimed to maximize variability in facial structure while controlling for factors such as race, expression, viewing angle, gaze, and background, which our present project did not intend to investigate. We first combined 909 high-resolution photographs of male and female faces from three publicly available face databases: the Oslo Face Database43, the Chicago Face Database42, and the Face Research Lab London Set41. We then excluded faces that were not front-facing, not with direct-gaze, with glasses or other adornments obscuring the face. We further restricted ourselves to images of Caucasian adults and neutral expression. This yielded a set of 426 faces from the three databases.
To reduce the size of the stimulus set while conserving variability in facial structure, we sampled from the 426 faces using maximum variation sampling. For each image, the face region was first detected and cropped using the dlib library44, and then represented with a vector of 128 computationally extracted facial features for face recognition, using a neural network provided within the dlib library that had been trained to identify individuals across millions of faces of all different aspects and races with very high accuracy44. Next, we sampled 50 female faces and 50 male faces that respectively maximized the sum of the Euclidean distances between their face vectors. Specifically, a face image was first randomly selected from the female or male sampling set, and then other images of the same gender were selected so that each new selected image had the farthest Euclidean distance from the previously selected images. We repeated this procedure with 10,000 different initializations and selected the sample with the maximum sum of Euclidean distances. We repeated the whole sampling procedure 50 times to ensure convergence of the final sample. All 100 images in the final sample were high-resolution color images, with the eyes at the same height across images, had a uniform grey background, and were cropped to a standard size. See preregistration at https://osf.io/6p542.
To verify the representativeness of our selected 100 face images, we again performed UMAP analysis46 to compare the distribution of our selected faces with a) N = 632 neutral, frontal, white faces from a broader set of databases47–49 (Fig. 2c-d) and b) N = 5376 white faces “in the wild” 57,58 that varied in angle, gaze, facial expression, lighting, and backgrounds (Supplementary Fig. 1b), using the 128 computationally extracted facial identity dimensions44 as well as 30 traditional facial metric dimensions42 (Supplementary Fig. 1c).
Freely generated trait words
To verify that our selected 100 trait words were indeed representative of the trait judgments people spontaneously make from faces, we collected an independent dataset from participants who freely generated words about the person that came to mind upon viewing the face. As preregistered, 30 participants were recruited via MTurk (see preregistration at http://bit.ly/osfpre4); different from the preregistration, we decided to not only include Caucasian participants but included participants of any race (27 participants were white, 3 participants were black).
Participants viewed the 100 face images one by one, each for 1 second, and typed in the words (preferably single-word adjectives) that came to mind about the person whose face they just saw. Participants could type in as many as ten words and were encouraged to type in at least four words (the number of words entered per trial—words entered by a participant for a face—ranged from 0 words [for 8 trials] to 10 words [for 190 trials] with mean = 5 words). There was no time limit; participants clicked “confirm” to move on to the next trial when they finished entering all the words they wanted to enter for the current trial. All data can be accessed at https://osf.io/4mvyt/.
Study 1 Participants
All studies in this report were approved by the Institutional Review Board of the California Institute of Technology and informed consent was obtained from all participants. We predetermined our sample size for Study 1based on a recent study that investigated the point of stability for trait judgments of faces59: across 24 traits, a stable average rating could be obtained in a sample of 18 to 42 participants (ratings were elicited using a 7-point rating scale, the acceptable corridor of stability was +/- 0.5, and the confidence level was 95%). Based on these findings, we preregistered our sample size for Study 1 to be 60 participants for each trait (at https://osf.io/6p542).
Participants were recruited via MTurk (N = 1,500 (800 males), Age (M = 38 years, SD = 11), the median of educational attainment was “some post-high-school, no bachelor's degree”). All participants were required to be native English speakers located in the U.S. of 18 years old or older, with normal or corrected-to-normal vision, with an educational attainment of high school or above, and with a good MTurk participation history (approval rating ≥ 95%).
We also collected data about whether our participants were currently being treated for psychiatric or neurological illness. The majority of our participants (79.7%) were not currently being treated for any psychiatric or neurological illness. All dimensional analyses that are reported in the main text on the full sample were repeated also on those 79.7% of participants and the results corroborated all findings from the full dataset: Tucker indices of factor congruence for the four dimensions = 1.00, 1.00, 0.99, 0.99.
Study 1 Procedures
All experiments in Study 1 were completed online via MTurk. Considering the large amount of time it would take for a participant to complete ratings for all 100 traits and 100 faces, we divided the experiment into 25 modules: the 100 traits were randomly shuffled once and divided into 25 modules, each consisting of 4 traits. Each participant completed one module.
To encourage participants to use the full range of the rating scale, we briefly showed all faces (in five sets of arrays of 20 each) at the beginning of a module, so that participants had a sense of the range of the faces they were going to rate. In each module, participants rated all faces on each of the four traits (in random order) in the first four blocks; in the last (fifth) block they rerated all faces on the trait they were assigned in the first block again, thus providing sparse within-subject consistency data.
At the beginning of each block, participants were instructed on the trait they were asked to evaluate and were provided with a one-sentence definition of the trait (Supplementary Table 1). Participants viewed the faces one by one in random order (each for 1 second) and rated each face on a trait using a 7-point rating scale (by pressing the number keys on the computer keyboard). Participants could enter their ratings as soon as the face appeared or within four seconds after the face disappeared. The orientation of the rating scale in each block was randomized across participants. At the end of the experiment, participants completed a brief questionnaire on demographic information. See preregistration at https://osf.io/6p542.
Measures of reliability in Study 1
Data were first processed following three preregistered exclusion criteria (see preregistration at https://osf.io/6p542): of the full sample with a registered size of N = 1,500 participants and L = 750,000 ratings, n = 48 participants and l = 27,491 ratings were excluded from further analysis. Each of the 100 traits was rated twice for all faces by nonoverlapping subsets of participants (ca. n = 15 per trait). As preregistered, we applied linear mixed-effect modeling to assess within-subject consistency, which adjusted for non-independence in repeated individual ratings by incorporating both fixed effects (that were constant across participants) and random effects (that varied across participants). Ratings from every participant for every face collected at the second time were regressed on those collected at the first time (ca. l = 1,445 pairs of ratings per trait) while controlling for the random effect of participants.
As preregistered, we assessed the between-subject consensus for each trait with intraclass correlation coefficients (ICC(2,k)), using ratings of every face by every participant (ca. n = 58 participants and l = 5,780 ratings per trait). A high intraclass correlation coefficient indicates that the total variance in the ratings is mainly explained by the variance across faces instead of participants. We observed excellent between-subject consensus (ICCs greater than 0.75) for 93 of the 100 traits, and good between-subject consensus for the remaining 7 traits (ICCs greater than 0.60) [see Fig. 3].
Determination of the optimal number of factors
As recommended50,51,60,61, five methods were included to determine the optimal number of factors to retain in EFA. No single method was regarded as the best method for determining the number of factors; solutions are considered most reliable when multiple methods agree. Parallel analysis retains factors that are not simply due to chance by comparing the eigenvalues of the observed data matrix with those of multiple randomly generated data matrices that match the sample size of the observed data matrix. Prior studies showed that parallel analysis produces accurate estimations of the number of factors consistently across different conditions (e.g., the distribution properties of the data) 60,61. Cattell’s scree test retains factors to the left of the point from which the plotted ordered eigenvalues could be approximated with a straight line (i.e., retains factors “above the elbow”). The optimal coordinates index provides a non-graphical solution to Cattell’s scree test based on linear extrapolation. Empirical Bayesian information criterion (eBIC) retains factors that minimize the overall discrepancy between the population’s and the model’s predicted covariance matrices while penalizing model complexity. Velicer’s minimum average partial (MAP) test is “most appropriate when component analysis is employed as an alternative to, or a first-stage solution for, factor analysis”62. It is also included in our present study due to its popularity. MAP retains components by partialing out those that resulted in the lowest average squared partial correlation.
Labeling of Dimensions
Dimensionality reduction methods do not provide labels for the factors discovered, which must instead be interpreted by the investigators. We note that our third and fourth dimensions describe stereotypes related to gender (femininity-stereotypes) and age (youth-stereotypes) commonly reported in the literature11. In fact, essentially all trait judgments based on faces, and therefore all of our dimensions, are a reflection of people’s stereotypes of some sort, since in our study nothing else is known about the people whose faces are used as stimuli, and therefore no ground truth is provided. We therefore omitted “-stereotypes” in our labeling of all dimensions, since it implicitly applies to all of them.
Confirmatory analyses with artificial neural networks and cross-validation
To compare different theoretical models and test potential nonlinearity in our data, we employed an artificial neural network approach, in particular, autoencoders63, with cross-validation. The aim of an autoencoder model is to learn a lower-dimensional representation of the data. We constructed different autoencoders based on the different models we wished to test (the existing 2D and 3D frameworks13,37, the 4D framework from EFA). We trained these autoencoders on half of the data (for each trait, 50% of the individuals were randomly selected and their ratings were used to compute new aggregated ratings per face per trait) and tested them on the held-out other half of the data. We used the Adam optimization algorithm52 and mean squared error loss function with a batch size of 32 and 1500 epochs to train the neural networks (the loss converged after 1000 epochs in all our models). We repeated this process for 50 iterations and compared the performance of different models. For completeness, both linear and nonlinear activation functions were explored for model fitting (linear, tanh, sigmoid, rectified linear activation unit, L1-norm regularization, Fig. 4b-c); a simple linear activation function ended up with the best results.
Existing frameworks13,37 suggest that all face-impression dimensions are of the same order (i.e., no dimension is a higher- or lower-order dimension of the others), but that the number of dimensions varies. Therefore, we first constructed different autoencoder models with only one hidden layer that varied in the number of neurons in this hidden layer, corresponding to the number of underlying dimensions (from 1 to 10). The input layer and output layer were the same for all models, where each face was represented by a vector of ratings across the 92 traits and each trait corresponded to a neuron. All layers were densely connected. We trained these different models and compared their performance (assessed with the explained variance on the held-out test data).
In addition, we tested potential hierarchical factor structure in our data by adding one hidden encoder layer with various numbers of neurons (from 1 to 10) before the middle hidden layer (also with various numbers of neurons from 1 to 10); since autoencoder models are by definition symmetric, these hierarchical latent structures were mirrored in the decoder layers (i.e., three hidden layers). Results showed that adding hidden layers did not increase model performance.
Study 2 Participants
The study was approved by the Institutional Review Board of the California Institute of Technology and informed consent was obtained from all participants. We preregistered to recruit participants through Digital Divide Data, a social enterprise that delivers research services, in seven countries/regions of the world: North America (U.S. and Canada), Latvia, Peru, the Philippines, India, Kenya, and Gaza. All participants were required to be between 18–40 years old, proficient in English (except participants in Peru, where everything was translated to Spanish), have been educated at least through high school, have been trained in basic computer skills, and have never visited or lived in Western-culture countries (except participants in North America and Latvia). In addition, we aimed to have a roughly equal sex ratio of participants in all locations.
The sample size for each location was predetermined to be 30 participants. This sample size was determined based on two criteria: first, the sample size should be large enough to ensure stable average trait ratings (for a corridor of stability of +/- 1.00 and a level of confidence of 95%, the point of stability ranged from 5 to 11 participants across 24 traits59); second, the sample size should be feasible to accrue at all seven locations given the requirements mentioned above and the availability of participants for paying multiple visits to complete all our experiment sessions over a 10-day period. See preregistration at http://bit.ly/osfpre2. As planned, 30 individuals (15 females and 15 males) in each of the seven locations participated in our study (Age (M = 26, SD = 4) for North America; Age (M = 28, SD = 5) for Latvia; Age (M = 22, SD = 3) for Peru; Age (M = 25, SD = 4) for the Philippines; Age (M = 27, SD = 6) for India; Age (M = 24, SD = 2) for Kenya; and Age (M = 26, SD = 5) for Gaza).
Study 2 Procedures
All experiments were completed onsite in the Digital Divide Data local offices. Participants in North America, Latvia, the Philippines, India, Kenya, and Gaza completed all experiments in English. Participants in Peru completed all experiments in Spanish. An exact translation of the experiment instructions, trait words, and definitions of the traits from English to Spanish was provided by the Peru office of Digital Divide Data. Both the English and Spanish versions of those materials can be accessed at our preregistration (https://osf.io/qxgmw).
Eighty of the 100 trait words were used in Study 2—twenty words were excluded for their low correlations with other traits as found in Study 1 (sarcastic, white, thrifty, shallow, homosexual, nosey, conservative, and reserved), their ambiguity or similarity in meaning as found in feedback from Study 1 (trustful, natural, passive, reasonable, strict, enthusiastic, affectionate, and sincere), and their potential offensiveness in some cultures (idiot, loser, criminal, and abusive).
Participants in all seven countries/regions followed the same experimental procedures. Each participant provided ratings of all faces on all traits, of which 20 traits were rated twice for within-subject consistency (see our preregistration). The 80 traits were divided into 20 modules, each consisting of 4 distinct traits (the 20 retested traits were first assigned to distinct modules and then the other traits were randomly assigned across modules with the constraint that traits in the same module should be balanced in valence). All participants completed all 20 modules during multiple visits to the local offices in ten business days. Each module consisted of 5 blocks, with the retested trait always shown in the first and last blocks and the other traits shown in random order. The experimental procedure within each module was identical to Study 1.
Measures of reliability in Study 2
Data were first processed following our preregistered exclusion criteria A to C (see preregistration at https://osf.io/tbmsy): of the full sample with a preregistered size of N = 30 participants and L = 300,000 ratings at each of 7 locations (N = 210 total), we excluded from further analysis n = 1 participant in India and l = 24,236 ratings in North America, l = 2,507 ratings in Latvia, l = 16,366 ratings in Peru, l = 3,178 ratings in the Philippines, l = 14,389 ratings in India, l = 9,117 ratings in Kenya, and l = 4,096 ratings in Gaza. Registration criterion D was not applied for the analyses of within-subject consistency and between-subject consensus because it imposed a strict lower bound on the within-subject consistency to ensure data quality, which might lead to an overestimation of the reliability of the data.
All participants at all locations rated a subset of twenty traits twice for all faces. Analyses of within-subject consistency identical to those in Study 1 were performed for each of the seven datasets (l = 100 pairs of ratings across faces per participant for ca. n = 28 participants per location). We found acceptable within-subject consistency at all locations (rs > 0.20, except for the ratings of competent, religious, anxious, and critical in India [rs = 0.18, 0.18, 0.19, 0.19] and the ratings of anxious in Peru [r = 0.19]). As hypothesized in our preregistration, across all locations, ratings of traits regarding physical appearance had higher within-subject consistency (e.g., feminine, youthful, healthy, with mean rs = 0.74, 0.57, 0.51, respectively) than traits that were more abstract (e.g., critical, anxious, religious, with mean rs = 0.31, 0.32, 0.33, respectively), corroborating findings from Study 1 (Figs. 3–4).
Assessment of between-subject consensus at each location used data from all participants within the same location (l = 100 ratings per participant for the 100 faces from ca. n = 28 participants per trait per location). As hypothesized in our preregistration, traits regarding physical appearance such as feminine, youthful, beautiful, and baby-faced showed high between-subject consensus in all seven locations (all ICCs > 0.86). At the other extreme, some locations had trait ratings with near-zero consensus within that location (the ratings of compulsive in Gaza, prudish in India and Kenya, self-critical in Gaza and the Philippines). This stood in contrast to the findings from Study 1 where ICCs > 0.61 for all the one hundred traits (Fig. 3), and to the samples from North America (ICCs > 0.61 for all traits) and Latvia (ICCs > 0.50 for all traits).
Data processing for RSA and dimensionality analysis in Study 2
To ensure high quality and complete data from individuals, we registered four exclusion criteria (A-D) while data collection was underway and data had not yet been analyzed (see registration at https://osf.io/tbmsy), in addition to those planned in our original preregistration (https://osf.io/qxgmw). Analyses of representational similarity and dimensionality for both aggregated and individual data were performed using data that were processed with exclusion criteria A-D. Following those criteria, thirty-one participants across seven locations were excluded for further analysis (n = 3 for North America, n = 2 for Latvia, n = 7 for Peru, n = 3 for the Philippines, n = 10 for India, n = 2 for Kenya, and n = 4 for Gaza). Among those remaining participants, n = 86 participants had complete data for all 80 traits—data from these 86 participants were used in the individual-level analyses (Fig. 7).
Data and code availability
All data, codes, and materials are available at Open Science Framework: https://osf.io/4mvyt/ and https://osf.io/xeb6w/.