Symptom cluster patterns based on text clustering method of COVID-19 and its population characteristics in Sichuan province, China

Background: Evidence of the aggregation of COVID-19 symptoms was still limited. Exploration of likely cluster patterns of symptoms may be helpful. Methods: This study enrolled a total of 1067 COVID-19 cases. Symptom cluster patterns were explored by text clustering method. A multinomial logistic regression was applied to revealed its population characteristics. And time intervals between symptoms onset and the rst visit was analyzed to take into account the symptoms progression over time. Results: Based on text clustering, the symptoms were summarized into 4 groups. Except for the group without obvious symptoms, the dominate individual symptoms under the other three groups were fever (68.7%), expectoration (59.4%) and fatigue (42.7%), respectively. And the most frequent symptom combinations were fever only (47.8%), expectoration only(19.8%) and fatigue accompanied with fever (4.2%), respectively. People aged 45-64 years were more likely to have symptom group 4 than those aged 65 years or older(OR = 2.66, P = 0.015) and had longer time intervals. Conclusions: The symptoms cluster patterns and common symptom combinations under each pattern may provide information for identication of the infected. Middle-aged population was a group expecting more attention, from the perspective of medical delays.


Background
The coronavirus Disease-19 (COVID-19) pneumonia has caused signi cant morbidity and mortality worldwide [1]. At this time, while the epidemic has been under well control to a certain degree, there are possibilities of re-outbreak caused by sporadic cases [2]. Warning signs from early symptoms may be helpful.
Clinical symptoms, as an indicator for identi cation and diagnosis, plays a vital role in the early detection and treatment. COVID-19 has a wide range of clinical manifestations, ranging from asymptomatic to severe viral pneumonia [3,4]. It has been widely con rmed that fever, dry cough, expectoration, and fatigue were the most common symptoms of COVID-19 patients [3,[5][6][7]. In addition, symptoms of cardiovascular system [8], digestive system [9], skin [10], and loss of taste and smell [11] were also reported. A number of studies have also conducted dynamic studies of the clinical course of symptoms [5,12,13]. There are also studies having revealed similarities and differences in symptoms between COVID-19 and other respiratory infections like in uenza.
Nevertheless, most of published studies have primarily focused on descriptions of individual symptoms.
Noting that two or more symptoms normally coexisted in one infected person and there are usually similarities of symptom combinations between individuals, the purpose of this study was to explore whether there are likely cluster patterns of symptoms in COVID-19 patients basing on the aggregation of symptoms with text clustering method. Giving that there are both overlap and variances in symptoms of COVID-19 and other infectious diseases like u [14], the likely cluster patterns and common symptom combinations under each pattern may provide some enlightenment to improve early identi cation of COVID-19. Based on the exploration of symptom cluster patterns, this study also analyzed the population characteristics of different symptoms groups. Furthermore, to take into account the progression of symptoms over time, time intervals between symptoms onset and the rst visit were analyzed.

Study design and data source
In this cohort study, a total of 1067 laboratory con rmed cases of COVID-19 from January 21, 2020 to November 20, 2020 in Sichuan Province were included. All data were sourced via the Epidemic Registration System of the Sichuan Center for Disease Control and Prevention (CDC). This study was approved by Ethics Committee of Sichuan Provincial Center for Disease Control and Prevention (NO. SCCDCIRB 2020-007). Written informed consent was obtained from each of the subjects.

Statistical analysis
Symptoms were rstly explored by text clustering method on the basis of Euclidean distance. Then combined with pathophysiology [15,16] and consultation with clinical experts from the Sichuan Provincial Center for Disease Control and prevention, the symptoms were summarized into different groups. Bar charts were used to give a visual representation of the symptom combinations under each group. In addition, a word cloud map was used to display the dominating symptoms of different groups according to their frequencies. Categorical variables were described by frequency and percentage, continuous variables were described by median and interquartile ranges (IQR).
Based on the classi cation, a multinomial logistic regression was applied to identify potential factors associated with the symptom groups. Symptoms of group 1 was the reference category in the multinomial regression model. Additionally, the interval between symptoms onset and the rst visit in each group was represented by the change of proportions of each group over time. Figure 1 is a diagram showing the procedure of our analysis. In this article, the text clustering was conducted with Python version 3.7.6 and the rest statistical analyses were conducted with R version 4.0.3. P value less than 0.05 was considered statistically signi cant.

Symptom cluster patterns
As of Nov. 20, 2020, information of 1067 cases were collected. Based on the results of text clustering, symptoms were summarized into the following 4 groups: Group 1: No-obvious symptoms, refers to those with no obvious symptoms but positive nucleic acid test; Group 2: Mainly fever and/or dry cough, refers to those with fever as the main symptoms, or accompanied by dry cough; Group 3: Mainly upper respiratory tract infection symptoms, refers to those mainly with expectoration and upper respiratory tract infection symptoms, such as, pharyngodynia, stuffy nose and runny nose, or accompanied by fever; Group 4: Mainly cardiopulmonary, systemic and/or gastrointestinal symptoms, refers to those whose main symptoms were cardiopulmonary symptoms such as shortness of breath, dyspnea, chest tightness, chest pain, and/or systemic symptoms such as fatigue, chills and muscle aches, and/or symptoms of the gastrointestinal system such as nausea, vomiting and diarrhea, sometimes accompanied by fever and upper respiratory tract symptoms.
The results showed that more than half (50.7%) of the infected persons did not show obvious symptoms at the rst visit, i.e. in the group 1. For the three groups with obvious symptoms, the proportions were 12.6%, 10.0% and 26.8%, respectively. Group 4, that were cardiopulmonary, systemic and/or gastrointestinal symptoms had higher proportion.
In order to pro le the symptoms composition under each group, bar charts were applied to visualize the particular symptoms under each group (Fig. 2). It could be seen that there were overlaps and interactions of symptoms under a same group. In symptom group 1, all with no-obvious symptoms (541 cases,100%); In symptom group 2, the most frequent symptom combinations were fever only (64 cases, 47.8%) , followed by dry cough only(42 cases, 31.3%); In symptom group 3, the most frequent symptom combinations was expectoration only(21 cases, 19.8%), followed by fever complicated with expectoration (10 cases, 9.4%); In symptom group 4, the most frequent symptom combinations was fatigue complicated with fever (12 cases, 4.2%), the incidence of headache complicated with fever was also high (11 cases, 3.8%). Fig. 3 shows a word cloud based on the frequency of individual symptoms under each group. The larger the font, the higher the frequency. It could be seen that fever and dry cough were the two most frequent symptoms in general, with frequencies of 64.4% and 38.8%, respectively, followed by expectoration (12.0%) and fatigue (11.4%). Fever (68.7%) and dry cough(52.24%) were the dominant symptoms in group 2; Expectoration(59.4%) was the dominant symptom in group 3; And fatigue(42.7%) and headache(26.2%) were the dominant symptoms in symptom group 4. Under each symptom group, symptoms showed some clustering around the dominant symptoms.
For comorbidities, the prevalence of hypertension was 6.84%, while were 2.44%, 1.88% and 2.06% of diabetes, lung disease and cardiovascular disease, respectively. In addition, 41.24% of the infected patients were imported cases and 26.43% were infected with family cluster.

Factors associated with different symptom groups
The results of multinomial logistic regression ( Table 2) showed that age, comorbidities and epidemiological characteristics were all independent in uencing factors of the presence of symptom group 4, namely symptoms such as cardiopulmonary, systemic and/or gastrointestinal symptoms.
Compared with the 0-12 years age group, the odds of symptoms of group 4 increased in both the  years and 45-64 years age groups (OR = 4.08, P = 0.032; OR = 5.91, P = 0.007). In addition, in order to further analyze whether there is any difference in the odds of the three obvious symptom groups in age groups compared with ≥65 years group, an analysis was carried out with the group of ≥65 years as the reference. It is noteworthy that people aged 45-64 years were more likely to develop symptoms of group 4 (OR = 2.66, P=0.015) when compared with the ≥65 years group.
No signi cant differences in the odds of the above three obvious symptom groups were detected between the sexes. For the comorbidities, the odds of showing symptom group 2 was no signi cant differences between patients with and without diabetes (P = 0.111), but in those with diabetes, the odds of group 3 and group 4 had signi cantly escalated (OR = 29.43, P= 0.004; OR = 41.72, P = 0.001), indicating diabetes a strong risk factor for upper respiratory tract symptoms, cardiopulmonary, systemic and/or gastrointestinal symptoms. In addition, the results showed that there was no signi cant difference in the odds of all the three obvious symptom groups between patients with or without hypertension, lung disease or cardiovascular disease.
Besides, the results showed that, the incidences of all the 3 obvious symptom groups were lower in the imported cases and the patients infected with cluster than in the indigenous cases and non-clustered cases, respectively (OR< 1, P < 0.05).

Time intervals between symptoms onset and the rst visit
In all the symptomatic cases, the median time interval between symptoms onset and the rst visit was 1 day, and the interquartile range was (0,3) days. 47.5% of symptomatic patients visited a medical institution on the day of symptoms onset, 15.4% one day after onset and 11.4% two days after onset, and 25.7% sought medical treatment three days or more after onset. Fig. 4 is a graph displaying the proportions of cases with the three groups of obvious symptoms as the time intervals lengthened. It could be seen that the proportion of symptom group 2 was decreasing as the time interval lengthened, while in symptom group 4, it was increasing over longer time intervals, and in symptom group 3, its proportion peak was in the middle.
Noting that the results showed people aged 45-64 years were more likely to show more severe symptoms than people aged 65 or older, in order to explore whether this was affected by the progression of symptoms, we analyzed the time intervals between symptom onset and visit in different age groups. The results showed that the median time intervals of 0-12, 13-44, 45-64 years groups were all 1 day, while it was 0 day of ≥65 years group (Fig. 5). And the ranges were longer in 13-44 years age group and 45-64 years age group, with ranges of (0,14) days and (0,15) days respectively, while which in 0-12 years age group and ≥65 years age group were (0,7) days and (0,8) days respectively. Patients aged 12-64 years had longer time intervals.

Discussion
This study focused on the aggregation of different symptoms, have explored the symptoms cluster patterns. Like many prior studies, we found fever, dry cough were the most common symptoms, followed by expectoration and fatigue [18,19].There existed probable different patterns of symptoms, which could be summarized into four groups. And we illustrated the speci c symptom combinations under each group.
The most frequent individual symptom were fever, expectoration and fatigue in the three groups with obvious symptoms, respectively. And the most frequent symptom combinations were fever only, expectoration only and fatigue accompanied with fever, respectively. It had been con rmed that both COVID-19 and in uenza have fever, cough, expectoration and fatigue with their main symptoms [20][21][22].
On the other hand, some symptoms such as vomiting, nasal congestion, runny nose and ocular symptoms are more common in in uenza than COVID-19 [21][22][23], while in COVID-19, symptoms such as fatigue, neurological symptoms(like headache), gastrointestinal symptoms are more common [22,24,25].Giving there are both overlaps and variation between the two, judgment relying on a single symptom is likely to bring misunderstanding, which is of little value for early identi cation. Therefore, awareness of the combination of symptoms of COVID-19 and commonly accompanying symptoms may provide some information for distinguishing it from other respiratory infections like in uenza through symptoms.
Besides, the results of multinomial logistic regression showed that compared with younger aged (0-12 years) people, those aged 13-44,45-64 and ≥ 65 years had increased odds of developing symptom group 4. This has been con rmed in previous studies that immunosenescence and in amm-aging may be the origin [26,27]. For the comorbidities, patients with chronic diseases such as diabetes were more likely to show symptoms of group 4, which was also been con rmed [28]. In addition, the results showed that for the imported cases and the clustered cases, the odds of symptom group 2, group3 and group 4 were all lower than indigenous cases and non-cluster cases, respectively. For imported cases, the entry quarantine of the imported may provide an explanation. And for the results that cases with non-cluster had more severe symptoms, this may be reasonable that infecting occurred within a same family, work unit, nursery or school means an infected person was more likely to be found as a close contact of whom with which he/she was clustered, thus was more likely to be found at the early stage [26].
For the result that the prevalence of symptom group 4 (26.8%) was higher than that of group 2(12.6%) and group 3 (10.0%), we took consideration of the progression of symptoms over time. From the results of the time intervals analysis, the proportion of symptom group 2 decreased with the extension of the time interval, on the other hand, the proportion of group 4 increased. This suggested a chronological order of occurrence of different symptom patterns, which had also been con rmed in some previous studies examining the dynamic changes of symptoms. According to Joseph R. et al. [29], whose analysis of the symptoms in 55,924 con rmed cases based on a Markov-process showed that there was a possible order in the development of COVID-19 symptoms. The symptoms may progress initially with fever or cough followed by upper respiratory symptoms such as sore throat, after fatigue and other systemic symptoms, and then gastrointestinal symptoms such nausea, vomiting, diarrhea and abdominal pain. Huang Hanping. et al [30] analyzed the clinical characteristics of 305 patients in the early stage of the pandemic in Wuhan Jinyintan Hospital, China. They found that compared with symptoms in the early stages of disease, as the time interval lengthened, the incidence of cardiopulmonary symptoms increased signi cantly. A similar pattern was also found in the work of Barak Mizrahi. et al. [12].These results re ected that longer intervals may mean a higher likelihood of cardiopulmonary symptoms such as dyspnea, and systemic symptoms and/or gastrointestinal symptoms, which were exactly the group 4 symptoms in this article. This revealed that the occurrence of these three symptoms do have chronological order. The occurrence of group 4 was later than that of group 2 and 3.
Another concern was that the odds of symptom group 4 was higher in patients aged 45-64 years than aged ≥ 65 years. Despite of the immunosenescence and in amm-aging [31], elderly people were not as likely to show more severe initially symptoms as expected. The in uence of symptoms progression may could not be neglected. Results in this article showed that people aged 13-44 years and 45-64 years have more cases with longer time intervals, indicating a time delay for medical treatment in this population. Similar to this, a study of 14,168 hospitalized infected people in Belgium also found that working age group (aged 20-60 years) had longer intervals between symptoms onset and their visits to a doctor [32]. In addition, it observed that for elderly people in nursing homes there was shorter delay times.
And among the subjects in our study, the prevalence of underlying disease in people with ≥ 65 was higher than that in younger people. Therefore, a possibility was that for the elderly people, out of the care for the underlying disease, abnormal body signals in this group may be more likely to be detected by caregivers or to be detected in medical facilities thanks to they were themselves in health care facilities when the pandemic occurred. In China, for example, China's national o ce for the elderly had also stressed the care services for the elderly during the epidemic [33]. In contrast, the ages 13-44 years and 45-64 years are exactly working ages, so this group may be more likely unable to see a doctor in time due to work-related concerns. The above may account for the fact that middle-aged people were more likely to have longer time delay for medical visit than older people, and as a result, had more severe symptoms when rst diagnosed. Thus, considering the delayed effect, this article suggested that middle-aged people, from the perspective of early treatment and early detection, may be a group needing attention in the prevention and control of the epidemic. Measures such as publicity may can be taken to improve the timeliness of medical treatment for the working-age population. Besides, for employers, it may be possible to relieve people's work-related worries through the provision of labor insurance bene ts.
In contrast to many published studies that mainly described only individual symptoms, this article focused on the associations between different symptoms, had explored symptoms cluster patterns. Based on it, we analyzed the population characteristics of each symptom group, and the in uence of symptoms development over time was also considered. We found that except the asymptomatic group, the highest frequency of symptoms combinations in the three groups with obvious symptoms were only fever, only expectoration and fatigue with fever, respectively. This can promote our knowledge of the symptoms of COVID-19 and provide some information for identifying infections. In addition, we found there was a chronological order in the occurrence of three symptom groups, symptom group 4 occurred later than symptom group 2 and 3. Else, this article revealed that people of working age were more likely to have a delay for medical treatment, as a result, with higher proportion of symptom group 4.
Our work had several limitations. Firstly, for comorbidities, information such as severity and duration was not collected, so the in uence effect of comorbidities may be biased by the heterogeneity of severity grade and duration of diseases. Secondly, for the time interval between symptom onset and the rst visit, there may be information bias of the self-reported time of symptom occurrence. In addition, the cluster patterns found in this article was just a probable result, more evidence are still needed in the future.

Conclusions
We creatively focused on and discovered different cluster patterns of COVID-19 symptoms. Under each pattern the common symptom combinations were revealed. It may be useful for identi cation of infected persons and distinguishing from other epidemic diseases at early stage. And we found that the middleaged population may be a group requiring more attention during this epidemic, and some measures are expected to improve the timeliness of medical treatment for this group.   Figure 1 The procedure of analysis in this article.  Time intervals between symptoms onset and the rst visit in different age groups