To enable the reuse of terminology data for ontology and knowledge graph construction, we referred to the top-down strategy for building domain specific ontologies [30, 31], and authoritative terminology system ICD, SNOMED CT for terminology criteria establishment [32, 33]. The designed workflow included six steps, respectively (1) Classification schema design (2) Concept representation model building (3) Term source selection and term extraction (4) Hierarchical structure construction(5) Quality control (6) Web service. All the editing part was performed on TBench, a work platform for cross-lingual terminology system editing and maintenance.
Classification schema design
The purpose of the classification schema design was to define a certain scope of the whole terminology, focusing on the important branches that need to be included. The classification schema design should take multiple dimensions of research in COVID-19 into consideration, e.g. the research direction of epidemiology and a specific disease, the data structure design, etc. Specifically, in terms of epidemiology, 8 categories were proposed including person status, affected group, disease distribution, disease spreading, incidence, occurrence; etiology, and the disease understanding . Still, other information categories were necessary to be mentioned (e.g. diagnosis method and treatment technique) and the terminology construction shared quite different emphasis compared to epidemiological concern, more requirements have to be considered especially when establishing a indexing-, annotation-, and retrieval-oriented terminology system. We also referred to other structure of terminology system, e.g. SNOMED CT where body structure, clinical finding, environment or geographical location, event, observable entity, organism, specimen, substance, etc. were incorporated. Since this category system was designed mainly for clinical use and general medical field, it might not be appropriate for a specific disease (e.g. COVID-19). To get more comprehensive perspectives for clinical and data researchers, we consulted experts in medical informatics and achieved agreements. We then developed 10 classification schema for the first level top nodes involving disease, anatomic site, clinical manifestation, demographic and socioeconomic characteristics, living organism, qualifiers, psychological assistance, medical equipment, instruments and materials, epidemic prevention and control, diagnosis and treatment technique.
Concept representation model building
Referring to SKOS Simple Knowledge Organization System, we designed a concept representation model where all the terms were organized on the basis of its core concept. As shown in the concept representation model (Figure 1), each core concept was assigned one particular concept ID and three elements, i.e. definition, term, and semantic type. The semantic type represents the most similar category and meaning of one concept, it is an efficient way to retrieve concepts and terms that have a certain semantic type. Each term was designed with attributes of term ID, lexical value (term content), term source, preferred term flag (whether this term is preferred term or synonym), and language. Semantic type ID, its lexical value and language information were introduced in semantic type. Definition involved lexical value, definition source and language information. In terms of relationships, each concept but leaf concept has sub concept of another concept. Each concept not in the top category is the sub concept of another concept. Each concept has its term, definition and semantic type. Each term is the term of one concept. Each definition is the definition of one concept. Each semantic type is the semantic type of one concept.
Term source selection and term extraction
A bilingual terminology system towards a worldwide emergency disease was supposed to be correct, authoritative and highly correlated to the theme, where exact bilingual concepts, semantic types, etc. should be demonstrated. Therefore, the information resources we took were limited to authority publishment from the situation report or document of World Health Organization, journal articles such as preprint, open access, etc., nationwide regulation, policy document, professional textbooks, etc. Bilingual terms were mostly extracted from bilingual WHO documents, textbooks, and related papers. Definitions were located from textbooks and related papers under most conditions.
Hierarchical structure construction
We adopted top-down and bottom-up synthesis approach to formulate the final hierarchical structures. On the one hand, related clinical classification architectures were identified and reused in associated documentation, literatures, textbooks (e.g. textbooks in epidemiology, virology, and preclinical medicine), etc. On the other hand, measures were taken for terms extracted from literatures and other resources, i.e. synthesis from bottom up. Agile model was adopted during the procedure, i.e. adjusting the structure by adding, altering or deleting specific substructures when necessary. The finalized hierarchical structure (Figure 2) was reviewed and assessed by one expert on clinical medicine and one two professionals on medical informatics.
Relationship and property development
Relationship and property of each concept were developed in this step. Each concept was assigned with properties including concept ID, term ID, bilingual semantic type, Chinese preferred term, and English preferred term as the obligatory items, and Chinese synonym, English synonym, bilingual definition, definition source as alternative items. We combined terms with same meanings as one concept with different synonyms; integrated terms within the same classification as one subset with various concepts. The editing date and time were automatically generated in the system. Among these properties, concept ID and term ID could be directly linked to other systems through automatic mapping; each definition was required with a source for users to look up to. Synonyms were not a prerequisite element but more synonyms would help with the search scope and term location.
To guarantee the correctness of the terminology, we performed quality control after each round editing and before each version update. After editing in each round, two examiners with professional background and related practice experience were invited to validate the accuracy of the terminology. A third party with clinical experts would be involved when disagreement was reached. Before each releasing round, we performed quality control via cross assessment, i.e. automatic checking and expert review. The former was responsible for repeated terms detection (i.e. repeated terms with different concept identification), language detection (i.e. English terms marked as Chinese or vice versa), abnormal character detection (i.e. term that cannot be read by machine), closed hierarchical relationship detection (e.g. whether there is circularity in a hierarchical tree), hierarchical depth detection (whether a term is too deep for users to browse), spelling detection (whether there is questionable spelling). The expert review covered classification checking (whether the classification is appropriate from professional perspective and whether a term is suitable under a specific category) and content checking (whether the definitions or synonyms of a certain term is correct and related). Based on the positive feedback of quality control, we updated and released the terminology online.
We built a website for COVID Term, making it available for users to access each updated data version. For each update round, enriched sub branches with abundant information were required, where up to date COVID resources e.g. the lancet coronavirus theme, NIH 2019 novel coronavirus theme, WHO COVID-19 theme, the New England Journal of Medicine COVID-19 theme [37-40], etc. were constantly followed by COVID team to provide most recent terminology. The terminology towards COVID-19 was named as COVID Term. Earlier versions were also released on the PHDA(Population Health Data Archive).