Selection of publications and their authors
We used data from SCImago Journal & Country Rank to retrieve all countries whose researchers authored at least 1000 scientific publications in 2020 in the field of medicine. SCImago Journal & Country Rank is a publicly available portal that includes scientific indicators for journals and countries developed from information in the Scopus® database.12 Citation data are from over 34,000 titles and over 5,000 international publishers. Seventy-five countries met the inclusion criterion for the study, as shown in Table 1 (country #1: USA with 277,130 publications, country #75: Cuba with 1,059 publications).
We also used data from International Migrant Stock 2020, available on the United Nations Population Division portal, to obtain the percentage of migrants by country in 2020. Data on estimates of the number (or "stock") of international migrants are presented as a percentage of the total population, by age, sex, and country of destination, and are based on national statistics, in most cases obtained from population censuses.13 We selected the 22 countries for which this proportion was below 2.5 percent (Table 1). We restricted the study to these countries only in order to obtain names of researchers that were as homogeneous as possible and representative of the selected countries. The proportion of migrants for these countries ranged from zero for Cuba to 2.2 percent for Japan and Poland.
Then, using PyMed,14 a Python library that gives access to PubMed, we extracted all publications in 2021 with at least one author affiliated with a university or research institute located in the selected countries (N=120,104). We obtained a csv file in which the variable 'authors' had the following form (example for a publication authored by three researchers):
[{‘lastname’ : ‘x’, ‘firstname’ : ‘x’, ‘initials’ : ‘x’, ‘affiliation’ : ‘x’}, {‘lastname’ : ‘y’, ‘firstname’ : ‘y’, ‘initials’ : ‘y’, ‘affiliation’ : ‘y’}, {‘lastname’ : ‘z’, ‘firstname’ : ‘z’, ‘initials’ : ‘z’, ‘affiliation’ : ‘z’}]
Using Stata, we created the variable 'author1' (i.e., data for first authors only) and the variable 'country1' (i.e., country of affiliation of first authors). We removed the publications for which the affiliation to the selected countries did not concern the first author. The study database contained data for 89,906 publications.
NamSor Applied Onomastics
The authors' names were classified with NamSor Applied Onomastics, a name recognition software.15 The software recognizes the linguistic or cultural origin of each name and assigns a gender (male or female) and/or an onomastic class (e.g., China, India). As the estimation is probabilistic, the software also provides a probability for the inference (‘probabilityCalibrated’) ranging from zero to one.
The names can be classified according to the continent of origin (three continents: Asia, Africa or Europe), the country of origin (e.g., China or India) and the ethnicity (e.g., Chinese or Indian). We created two other variables: continent#2 ("Europe" replaced by "Europe, America or Oceania") and country#2 ("Spain" replaced by “Spain or Hispanic American country” and "Portugal" replaced by "Portugal or Brazil"). We added these variables because a preliminary analysis of our data showed that a majority of researchers with Hispanic or Portuguese names who were affiliated with universities or research institutes in Brazil, Mexico or Cuba were considered to be from either Spain or Portugal.
Performance analysis
We evaluated NamSor’s performance by computing three efficiency metrics.11,16 These metrics refer to the confusion matrix that contains three components, with 'c' corresponding to correct classifications, 'i' to misclassifications (i.e., a wrong continent, country or ethnicity assigned to a name) and 'u' to non-classifications (i.e., no continent, country or ethnicity assigned).
|
Correct continent, country or ethnicity (predicted)
|
Incorrect continent, country or ethnicity (predicted)
|
Unknown (predicted)
|
Continent, country or ethnicity (actual)
|
c
|
i
|
u
|
errorCoded = ( i + u ) / ( c + i + u )
errorCodedWithoutNA = ( i ) / ( c + i )
naCoded = ( u ) / ( c + i + u )
These performance metrics can be interpreted as follows: errorCoded estimates the proportion of misclassifications and non-classifications (this measure therefore penalizes both types of errors equally), errorCodedWithoutNA measures the proportion of misclassifications excluding non-classifications and naCoded measures the proportion of non-classifications.
We repeated the analyses by removing all results with inference accuracy <40%, <50%, <60% and <70%, respectively. All assignments made with an accuracy level below the selected threshold value were considered as non-classifications. We performed all analyses with STATA version 15.1 (College Station, TX, USA).
Ethical considerations
Since this study did not involve the collection of personal health-related data it did not require ethical review, according to current Swiss law.