Psychosocial Factor Identification in Cancer Patient Community


 BackgroundThe open Suomi24 discussion forum, Finland’s largest topic-centric social media, provides excellent opportunities to study users’ interactions with various contemporary discussion topics. Its easy access and anonymity enable patient and users to seek valuable advices or share their experiences. This paper focuses primarily on cancer health users’ discussion forum and attempts to lie down foundations for content-analysis based approach for identifying psychosocial factors of patient cancer community in Suomi24.MethodThe methodology utilizes to a large extent innovative natural language processing based approach that calls for ontology based construction for identifying the relevant discussion threads in Suomi24 corpus. Next, a co-occurrence analysis with various distance level was employed to quantify the strength of the co-occurring words. Inspired by psychosocial factors analysis in related medical studies, a categorization for psychosocial factors has been put forward and investigated.ResultsThe results overlap to a large extent with some previous findings in clinical studies where the dominance of the social factors is highly reported. An open platform has been put forward to help both researchers and clinicians to study the Suomi24 collected dataset, with the potential correlation analysis with patient records through dedicated electronic medical record databases.ConclusionThis work paves the way for further investigation of medical textual data in order to correlate the finding with patient records as registered in electronic medical records. Besides, the analysis also highlights the challenge of accommodating standard NLP tools to Finnish language where more work is needed in order to match the performance of the available tools in English textual analysis.

speci cally, Suomi24 dataset covers two decades of online discussion history and contains over a 2B token corpus of Finnish discussion with almost 832.000 unique users per week each of whom on average spends 7 minutes on the forum at a time. This makes Suomi24 dataset an excellent laboratory for exploring the combining linguistic characteristics of text and statistical features of the discussion dynamics with social science research questions, opening up unique perspectives for combining qualitative trends with quantitative dynamics of interaction, users' emotions and sentiment, or explore various facets of con icting evidence. Academically, this offers a clear advantage to other social media, such as Facebook and Twitter, whose complete set of data is normally unattainable for such critical examination. This motivates the current use of the dataset for health-like analysis exploring the psychosocial factors of cancer patients.
Anonymous online discussions are known to provide an important arena for addressing and sharing health-related problems [12][13][14] where treatment e ciency, side effects, and personal experiences are very popular. Indeed, medicines and their dosages, side effects and availability are referred to, for example, when discussing the diagnosis and progress of diseases.
Besides, the criticism of health care practices is often associated with the limited, inadequate or excessive medication of people and overt medicalization of people's lives. Overall, anonymous social media data open up a perspective to peer-to-peer talk on what might be thought of as taboos in health issues, bypassing the standard professional setting of doctor-patient relation. From this perspective, talk about medication and associated lifestyle can be seen as part of the arena where people are conceptualizing and negotiating their health concerns together with the suitability of current practices and constraints.
Life-threatening cancer patients are often diagnosed by anxiety and depression, which require special attention from the community as a whole, see, for instance, [15]. Forum is often an excellent venue for conveying and highlighting such support where patients also attempt to broaden their cancer-knowledge and validate /refute information received from health care professionals. This extensively occurred in Suomi24 discussion threats as per prior investigations [16]. This includes emotional support through various expressions of care and concerns that ultimately help to improve patient's mood, sharing of experiences and new routes that may raise patient awareness, esteem and social network in a way to boost con dence while fostering (new) relationship based on shared attributes. Loosely speaking, this also contributes, to some extent, to the empowerment of the online cancer patient community.
Research on blogs ranges from blog identi cation [17], identifying the source of re-publication of blogs [18], link analysis [19], buzz analysis [20] and social network analysis [21]. A number of tools have also been developed to aid in visualizing all sorts of blog data, mainly with a focus on visualizing the connections between blogs and individual bloggers, or the dynamic of buzz terms, topics or selected attributes over time, causation content of blogs, among others, see, e.g., [22].
Although there is a long history of analyzing, visualizing and exploring English online discussions with text mining and machine learning methods [23], the effective handling of Suomi24 dataset is far to be an easy task and several challenges were identi ed. First, although the forum promotes thread-based discussion, it often occurs that health or cancer-related conversation takes place in non-health-related threads. Second, the anonymity and the absence of the concept of user ID renders any network like analysis quite di cult and limited. Third, the use of standard natural language processing tools is quite limited for the Finnish language. Indeed, whereas in English the vocabulary size tends towards a log-curve as the size of the material grows, in Finnish it is typical that as the size of the data set grows, the number of word forms continues to grow linearly [24]. This leads to data scarcity, which is a problem for many machine learning methods that rely for instance on bagof-words like representations of data. Moreover, in colloquial texts, rife with misspellings, acronyms, and invented words, the usefulness of linguistic analysis is limited. Fourth, the use of open-source translation API often yields unsatisfactory responses due to its inherent limitation in both scope and constructs. Fifth, cancer is one of the most sensitive issues that patients generally do not feel comfortable writing about in detail, which adds extra di culty to researchers to detect useful clues, and potentially identify appropriate metrics relevant to clinicians.
The current study proposes a practical tool that can be employed to explore the Suomi24 dataset to identify potential patterns that are associated with health queries. Especially psychosocial factors of cancer patients are investigated. Therefore, the main research questions highlighted in this article are 1) How to identify the psychosocial factors from online discussion forums? 2) How to design an automated approach that identi es such factors in Suomi24?
The solution proposed in this paper tackles the above research questions restricted to cancer patient discourse using a combination of natural language processing techniques, statistical analysis and sound statistical observations. Especially, original methodologies have been devised and implemented in order to (i) collect relevant dataset from overall Suomi24 corpus, (ii) identify the main psychosocial factor categories, (iii) characterize each category through a speci c ontology, (iv) evaluate the extent of each psychosocial factor in the corpus, and (v) interact with a wider community through a dedicated portal. Nevertheless, we believe that such a developed approach can ultimately be employed to explore other socio-linguistics patterns from the Suomi24 dataset. The rest of the article is structured as follows. Section 2 of this paper details the data collection procedure, key features of the dataset and the preprocessing methodology. Section 3 deals with the data analysis, which includes the corpus analysis and identi cation of psychosocial factors. Section 4 highlights the platform implementation. Finally, we conclude the article and highlight future work. [1] http://tnsmetrix.tns-gallup. /public/ Methods Suomi24 and healthcare data Suomi24 is a discussion forum that consists of publicly-available, user-generated discussions grouped by content, such as entertainment, hobbies, travel, and health. Being anonymous, users can start their discussions or contribute to existing ones.
In this study, the forum data was retrieved from a structured database, accommodated by the service provider. The license of the database, in compliance with copyright agreements by the World Intellectual Property Organization, grants the right to use and make copies of the corpus for educational, teaching and research purposes [25]. The dataset contains 352,725 posts in the Health category divided into 16 sub-categories, namely "ask your health questions", "birth control", "decease and mourning", "diseases", "drugs and addictions", "general health", "healthcare", "healthcare services", "medicines", "men's health", "mental health and wellbeing", "oral health", "plastic surgery", "senses (sensory organs)", "weight control", and "women's health". The distribution of the number of observations among the 16 categories is not uniform, see Table 1. Titles and posts contain 2.6 and 75.9 words on average, respectively. The median values of word counts are 2 for titles and 49 for posts as highlighted in Table 1. On the other hand, we also tried to quantify the extent to which the posts contain unknown wordings. The latter is obtained by looking into the presence of individual word at Finnish dictionary or one of the Finnish SMS databases (e.g., Slangi.net, Suomen murteiden sanakirja). Therefore, if a word can be found neither in the Finnish dictionary nor in the two SMS databases mentioned above, it is classi ed as Unknown in Table 1.
The review of health-related discussion threads in Suomi24 shown in Table 1 reveals the following: -There is a surge in discussions related to diseases and mental health/wellbeing as indicated by a large number of posts in these threads.
-The users use lengthy writing when it comes to mental health issues, possibly to show their emotions and feeling while attempting to pay special attention to patient experience and argumentation. It also indicates the willingness to share experiences.
-The occurrence of unknown wording, although small, is found to be relatively more important in Diseases thread and Medicine thread than others. One of the explanation for this observation arises from the technical terminology employed for diseases and medicine, which combined with users' voluntary/non-voluntary misspelling errors, resulted in unknown wordings.
-Despite its importance, "cancer" is not mentioned as a separate thread in the health discussion forum. This renders the search for cancer-related discussion spread across multiple threads. For instance, cancer-related wordings are found in Diseases, General Health, Healthcare, Mental Health & Wellbeing threads. This trivially does not mean that cancer-related wording cannot be found elsewhere. The absence of thread related discussion highlights the importance of keyword related search across the whole health Suomi24 dataset to identify key patterns that might be associated with cancer patients.

Data Collection
The rst step in our methodology is to gather cancer-related dataset from Suomi24 forum. The forum provides an API to get forum posts from the database. HTTP POST request with a "keyword" parameter was used to retrieve the corresponding discussions. For this purpose, we select a bag of keywords that include: i. Cancer and its synonyms as gathered from online Finnish dictionary, The format of the returned dataset is JSON, whereas Python library "ijson" was used to loop through all the available data.
Therefore, the procedure to select the research dataset is as follows: First, the entire health-related dataset pointed out in Sect. 2.1 is taken into account and considered as one source for cancer patient discussion. Second, we run a text search across the whole Suomi24 dataset for the above bag of keywords, and whenever a match of at least one keyword is found, the associated post, name of thread, ve posts prior the matching post as well as the following ve posts (provided the post does not occur at subsequent thread) are retrieved. The desire to retrieve prior and subsequent posts of the matching post together with the underlying thread are motivated by the concern to capture the context of the discussion. The overall schema is presented in Fig. 1.

Data Preprocessing
The preprocessing task aims to identify the boundaries of individual posts and their various constituents (tokens) taking into account a large number of potentially misleading content and writing styles that may cause unsatisfactory segmentation.
This includes, for instance, the presence of links, uncommon characters, miss-placed spaces or punctuation characters, which can impact further analysis. For this purpose, for each post in the cancer dataset, the following tasks have been performed.
i. All text is converted to lowercase letters/characters; ii. Whitespace is added after each dot unless if not available; iii. Any number of consecutive whitespace characters are transformed into a single whitespace character; iv. All URLs are removed together with any uncommon character (e.g., Chinese and Greek characters). Odd characters are also removed using regex-functions; v. Stopwords are removed to reduce the list of keywords. This list is borrowed from UniNE of Neuchâtel University, after manual checking to keep all adjectives, time/date information and negation related terms due to their importance in subsequent reasoning; vi. Tokenization is carried out to identify individual tokens of the posts. The classical white-space delimiter cannot work in the Finnish language due to the inherent structure of language that combines individual words to create more lengthy words. Therefore, the contribution of the Finnish language parser is ultimately important to distinguish individual wording. For this purpose, we use Finnish-deep-parser developed at Turku University; vii. A simple stemming approach is adopted. This aims to identify the root/stem of a word to reduce the number of word in ections, which, in turn, reduces the total number of keywords/tokens used to represent the sentence (post). For this purpose, we employed FinnPos an open-source morphological tagging and lemmatization toolkit for Finnish.

Data Analysis
Initial Analysis To comprehend the content of the gathered dataset, we report the distribution of the most frequent words in the ltered corpus and the frequency of the most occurring cancer terms to visualize the closeness/farness with the most occurring terms. Algorithm 1 described in Fig. 1 was used for term-frequency analysis. A summary of the attributes of the dataset is described in Table 2.
The results are presented in Table 3.
Commenting on the nding, one should notice the following: -The results were obtained after removing the stopwords, which would discard less useful terms in the study.
-For the clarity of illustration, we restricted to the most twelve frequent terms in each category (non-cancer-related terms and cancer-related terms). This is also motivated by the fact that the above terms constitute almost half of the terms which have at least three occurrences in the identi ed sample.
-The majority of non-cancer terms are related to affection, and various types of help (sharing experience, nance, advice).
-The graph also highlights the quasi-dominance of multi-form ("saa"), which translates the action of "getting", due to its wide usage in Finnish.
-The cancer-related terms are as expected less frequent than non-cancer-related terms by a factor of at least two. Besides, the treatment-related terms are even less frequent, which is motivated by the context employed to generate discussions around cancer-related terms.
-The cancer discourse highlights the importance of nancial aspects, treatment effects, death, disease, and their medication.

Evidence mapping
Co-occurrence analysis The rst stage in evidence mapping is to explore the most frequent co-occurrences in the ltered dataset as an indication of relatedness between the terms. In other words, this hypothesizes that the higher the co-occurrence between two terms, the higher the association between these terms.
Further development incorporated the functionality to distinguish the extent to which the co-occurring terms felt apart from each other. This re ects on the hypothesis that co-occurring terms that occur close to each other in the sentence or post are deemed more important than those terms that co-occur far apart from each other. On the other hand, to take into account the syntactic relationship among the co-occurring terms, a dependency tree was used; namely, Turku Dependency Treebank was employed to deal with the Finnish language. An example of the use of the Turku Dependency tree is shown in Fig. 3.
The use of a parser tree allows us to capture the relationship between wordings even if they are located far apart from each other. For instance, a verb and its subject, as well as its complement, would entail a direct link from the verb to its subject and from a verb and its complement. Therefore, we hypothesize that whenever two words, say, w 1 and w 2 are directly linked to each other in the view of the parser Tree, then the distance between these two words is equal to one even if they are far apart from each other, otherwise, the distance is calculated as the number of tokens that separate w 1 and w 2 . More formally, this can be formulated as follows From an implementation perspective, we restricted to ve different distance levels: 1-2 word distance, meaning next to each other (either directly or through dependency tree structure), which triggers a strong in uence on each other; 3-5 words distance, indicating potentially signi cant impact on each other and; 6-7 words apart, where the impact is likely to be medium. Finally, levels 9-14 words (apart from each other) translate a rather minor to low impact on each other, and 15 + distance (more than 15 words apart) implies that the co-occurrence is considered to have no impact on the syntactic, semantic or causal relationship between the two terms w 1 and w 2 . Algorithm 2 (see Fig. 4) highlights the implementation of the distance-based co-occurrence paradigm. Figure 5 highlights the word co-occurrences taking into account the various distances. While Table 4 reports on the top co-occurring pairs corresponding to general discussion and cancer related terms.
A quick reading of the results reveals the following: i. The scrutinizing of the highly co-occurring words, outside cancer-related discourse, shows the dominance of explanatory proposition as in (jostain-syystä), which translates (for some reasons) and timing propositions as in (pari-viikkoa), (viikon-päästä), which are translated into (couple of weeks), (in a week), respectively. Conditional propositions can also be noticed in (pitäsi-tehda) translating (would have to -make). Recommendation-like propositions are highlighted in the pair (ottaa -yhteytt), translating to (take -contact).
ii. Despite the quasi dominance of Finnish language, some English terminology is not There exist many English wording, which justi es the dominance of the determinant expression (the) and conjunction expression (and).
iii. There is a single co-occurrence that is more related to clinical setting (lääkäri -sanoi), a translation of (a doctor -said), which indicates the high interest of the community users to seek doctor advice and recommendations from other experience perspectives.
iv. The scrutinizing of the cancer-related discourse indicates the dominance of the nancial aspect where users/patients seek for treatment costs either directly or indirectly as in (maksaa-eur). The next co-occurring wordings are related to the type and effectiveness of the treatment (see, all pairs with "hoidot"). One also notices the emergence of family-related issues in (äiti-isä), translating (mother-father) relationship.
v. The preceding highlights the importance of treatment effects, nancial cost and family issues as dominant factors in the cancer discourse. Whereas the non-cancer discourse shows the importance of the clinical advice in Suomi24 health community, seeking doctors' saying and con rmation. On the other hand, the dominance of conditional and recommendation expressions indicates the importance of the argumentation structures and storytelling scenarios where users attempt to build arguments to reply to patient/user worry or through sharing a personal experience.
vi. The analysis of the results of Fig. 6 shows the large disparity of frequency values between the terms co-occurring at distance 1 (or occasionally, distance 2) and those co-occurring at higher distances (3-5, 6-8, 9-14, 15+) by an order of magnitude of more than 100! This indicates that the co-occurrences occurring at a distance of more than 2 are statistically insigni cant when compared to those occurring at distance 1 or 2.

Rationality
In this study, the psychosocial factors are categorized into ve categories: Financial, Illness, Treatment, Social and Death. The motivation for such categorization is based on commonsense reasoning, literature study and manual analysis of data.
Indeed, intuitively, cancer patients foresee death as the ultimate end either in short or medium term. Besides, the type of illness, the effectiveness of treatment, its associated cost (both clinical cost and family-related cost) as well as patient social life are naturally of paramount importance to any cancer patient. Second, previous studies reported the clinical and psychosocial factors associated with anxiety and depressive disorders in breast cancer patients and other cancer patients [26]. The authors identi ed three types of associated psychosocial factors: social support, family relationship, and con ict solving. Anxiety and depression in breast cancer patients were usually common in environments of poor family relationships, which also contributed to more pain and fatigue as well [27]. This stresses the importance of social factors as part of psychosocial factors for cancer patients. Studies in [28] suggested more social support was associated with lower cancer incidence and mortality, and longer disease-free interval and survival, which triggers the importance of treatment, illness and death factors. Similarly, authors in [29] found that the number of social ties and experienced social isolation was related to cancer incidence and mortality from cancer. Mofatt et al. [30] have found that the onset, treatment, and trajectory of cancer are associated with the nancial stress among patients, which has been identi ed as a signi cant unmet need. This stresses the importance of nancial factors as part of the overall umbrella of psychosocial factors. Third, manual scrutinizing of many Suomi24 posts revealed that potential treatment success and failure (death) appear among hot discussion topics.
Similarly, illness and treatment are usually associated with cancer conversations, where people usually seek education, peer support or advice to cope up with their disease and its associated challenges [31]. Time costs and out-of-pocket expenditures demonstrate the sizeable nancial burden that patients usually endure, often, with limited resources, relying on family members or friends [32], which promotes the role of nancial issues as part of the main psychosocial factors. Fourth, the term frequency and co-occurrence analysis carried out in the previous section undoubtedly supports the categorization above of the psychosocial factors. Indeed, a simple term frequency of cancer discourse reveals the dominance of nancial aspects, treatment effects, death and disease (or illness) related issues. Whereas the non-cancer discourse highlights the dominance of social factors. Indeed, family relationships, for instance, were among the top co-occurring words in Fig. 5. Besides, the abundance of argumentation and storytelling expressions, which helps to share story-telling and build convincing statements in a way to answer patient worries, testi es on the signi cance of psychological/social support.

Ontology characterization of categories
Inspired by techniques employed in some hate-speech detection research (see, e.g., review paper of Schmidt and Wiegand [33], the idea employed for building the ontology related to each category is the following. First, we start with a limited number of manually crafted words assigned to each category based on our understanding of the linguistic and contextual meaning of each category. Second, we perform a manual investigation of a random selection of around 1000 posts of the collected dataset. The aim of this investigation is to re ne the manually crafted words of each category by amending or adding extra wordings that we found relevant to a given category after reading the post (s), where it often occurs that many identi ed terms through post review are rather less linguistically and semantically related to the actual meaning of the category. On the other hand, given that several homonyms words are resulting from the manual craft and random post scrutinization phase, we revisit each homonym word and assigns the corresponding sense (synset) that best matches the underlined category. We used FinnWordNet, provided by University of Helsinki as part of FIN-Clarn project (see, FinnWordNet [34]) to generate the associated synsets. This operation is independently performed by three Finnish native speakers in order to eliminate any bias, con ict or dismiss of some interesting event (s). Third, using the (Finnish) WordNet lexical database, an automated system was created to add, for each word of a given category, its set of synonyms, antonyms, hyponyms, and hypernyms whenever available. If a newly constructed word (synonym, antonym, hyponym or hypernym) is found to belong to more than one category, it is systematically removed. Such reasoning although intuitive and innovative in its approach to generating a reasonably large collection of related words for each category has also its limitations as will be highlighted later on. The overall architecture of the ontology building is shown in Fig. 6.

Evaluation of categories in collected dataset
After generating the ontology associated with each psychosocial factor category, we would like to evaluate the occurrence of each category in the overall collected dataset.
For this purpose, since the ontology of each category is made of a set of terms generated through the process of Fig. 6, a simple string matching would enable us to determine the extent of the hit in the overall dataset. Nevertheless, special care should be devised for those homonym terms, which are represented by their synsets. In such cases, an extra step is taken to nd out whether there is a matching of senses as well or not. Consequently, there is a need to have a mechanism in place to identify the correct synset of a given hyponym word in the post. For this purpose, a novel mechanism is designed that makes use of semantic similarity approach. More speci cally, given a post P, tokenized as P = < T 1 , T 2 , …, T m > Assume, for instance, without loss of generality, that T 2 is the target word for which one seeks to determine the associated synset.
A blind approach that does not fully explore the whole syntactic relationship of the underlined sentence containing T 2 but is computationally appealing consists of (i) calculating the semantic similarity of T 2 with rst term situated next to it for which such semantic similarity exists using, for instance, Wu and Palmer wordnet similarity (Sim WP ) [35], and then (ii) retrieving the synset of T 2 that yields such similarity score. In other words, this boils down to the following: If Sim WP (T 1 , T 2 ) exists then retrieves synset (T 2 ) else If Sim WP (T 2 , T 3 ) exists then retrieves synset (T 2 ) else If Sim WP (T 2 , T 4 ) exists then retrieves synset (T 2 ) Etc.
The intuition behind such reasoning is that the wordnet semantic similarity score between two words T i and T j always outputs the maximum score that can be achieved when considering all combinations of synsets of T i and T j . Therefore, the generated synset corresponds to the sense that agrees most with the subsequent word in the post. This captures to a large extent the correct sense of the target word in the underlined sentence. The algorithmic implementation of the preceding is described in Algorithm 3. Else k = k + 1 GOTO 0 END Next, once the process of matching the homonym words to the post in achieved, the overall evaluation turns into a simple string matching process as a proportion of the occurrence of each factor in the whole collected dataset. This is performed using a simple word-hit based methodology. More formally, given the i th category of the psychosocial factor (Illness, Treatment, Social, Financial and Death), the associated proportion p i is calculated as: where c stands for a normalization factor, Post j corresponds to the set of tokens associated to j th post of the dataset, whereas C i refers to all wording (tokens) that can be associated with the i th psychosocial category. The sum in the above expression is performed over all the posts contained in the dataset.

Psychological factor categorization
The strength of each of the ve psychosocial factors using evaluations highlighted in expression (2-3) is illustrated in Fig. 7.
The results show the dominance of the social factors followed by nancial and treatment-related issues. Intuitively, cancer patients experience many worries e.g. how their spouses or children manage after their onset of cancer, whereas nancial burden may arise because of fear of salary loss due to illness, or potential salary cut. This is also positively correlated with other research ndings. For instance, (McFarland and Holland [37], Falisi [38]) stress on the tendency of cancer patients to build a social life that overcomes the psychological effects of the treatment and illness, whereas the nancial burden arises naturally as an implicit cost of either treatment or family-related charges.

Portal implementation
To enable a exible solution that explores Suomi24 dataset, as well as providing su cient user experience, we implemented a Web application supported by back-end database. As a platform, Apache HTTP Server was used as a web server, whereas Firebase was selected for the back-end database. The application was written mainly in JavaScript with React library. The use of web application allows easy access, and enables both clinician and academic researchers to test various research hypotheses and gain important insights regarding the behavior of the cancer patients as indicated by Suomi24 discussion forum. Figure 8 shows a user interface of the Web application with a query, where "Terveys", which means "Health", is one of the subtopics on Suomi24 discussions. The query result is shown in the same plot, where the co-occurrences and all the subtopics that happened to be in the detected discussions are presented, with a count of how many times it happened.

Visualization of effects if thread and psychosocial factors
Besides, an interface has also been designed in order to visualize, for each (highly frequently) co-occurring terms the effects of other metadata data of the Suomi24 dataset. This includes, for instance, the discussion threads (topics) where these terms co-occurred, the variation of word-distance between the co-occurred terms, and the psychosocial social factors identi ed across the various posts where the terms co-occurred. A snapshot from rebase with one example is illustrated in Fig. 9. Other functionalities include whether the co-occurring terms occurred in the title of the thread or sub-threads or only at the body of the post, the time slot and whether the username of the author is available.

Discussion
The above analysis although appealing due to its full autonomy and ability to accommodate expert knowledge in terms of ontology description of the various health-related concepts investigated throughout this study is not prone to inherent limitations.
First, as most data analytics techniques, the outcome is impacted by intrinsic limitations of text mining techniques. This includes (i) the effects of word normalization, which may eliminate some important elements of the sentence constructs and makes the output of the parser incorrect; (ii) the effects of composed words, which are sometimes not identi ed by the parser as so, trivially impact the performance of information retrieval task; (iii) the existence of multi-lingual expressions, where often English and Swedish expressions are found in Suomi24 dataset, which renders the parser output sometimes quite vulnerable.
Second, handling the Finnish language presents an extra di culty in the sense that existing tools, including parsers, word normalization tools, and the FinnWordNet are still unstable as compared to English language tools. This trivially affects the accuracy of the results as well. Therefore, the output of the wordsense disambiguation of homonym words, for instance, is subject to the instability of the FinnWordNet and the subsequent calculus of semantic similarity score. Nevertheless, it should be noted that to a large extent the homonymy affects only a tiny fraction of the category ontology words, which restricts the scope of such limitation on the evaluation results.
Third, the use of initial manual crafting for initial design ontology associated with each category is trivially subject to the subjective judgment of authors in charge of manual labeling according to their understanding of the scope of each category.
Besides, the inherent vagueness associated with the human language interpretation makes it very di cult to avoid overlapping wording that can naturally belong to more than one category.
Fourth, the recourse to a random selection of the posts to add / amend the initial crafting is subject to the randomness nature of those posts. Although the recourse to manual-post reading enables us to discover new jargon expressions employed by the users that would have been impossible to infer using solely simple linguistic interpretations of the scope of the categories, the extent of such jargon wording is far to be exhaustive. Indeed, many important wordings issued from local jargon describing an aspect of nance or social factors, for instance, might be overlooked. It is, therefore, not excluded that additional manual scrutinization of the posts would result in further expansion of the discourse of the category ontology.
Nevertheless, we acknowledge that unlike microblogging, the extent of such jargon is not that abundant in the Suomi24. This is also in agreement with some other related studies involving the Suomi24 corpus as part of the Academy of Finland Citizen Mindscapes project [38].
Fifth, the result of the rst and second phase in the ontology construction reveals the existence of several wordings belonging to the same root word but assigned to distinct categories. Accordingly, such wording would induce the same hyponyms, hypernyms, which would result in a cancelation of these words as part of the overall ontology construction. Likewise, regardless of whether the above words are of part of same word in ection or not, the use of hyponymy and hypernym relationship may also lead to words that are already assigned to other categories.
Sixth, the idea of systematically removing words that belong to distinct categories as part of design methodology can also be questioned. Indeed, although this expects to reduce the potential overlapping among these categories, which, in turn, increases their distinguishability, it can lead to an underestimation of their occurrences. Loosely speaking, the effect of such underestimation is not important because the core aim is rather to comprehend the extent of variations that exist among the occurrences of these categories in the corpus instead of an accurate estimation of the occurrence of individual category.

Conclusion
This paper provides an open platform for analyzing Suomi24 health data in order to identify psychosocial factors associated to cancer patients. The methodology makes use of innovative natural language processing and data fusion techniques that calls for co-occurrence analysis with various distance level in a way to quantify the strength of the co-occurring words. On the other hand, inspired by psychosocial factors analysis in related medical studies, a categorization for psychosocial factors has been put forward and investigated. The results reinforce some of previous nding in clinical studies related to the dominance of the social factors. An open platform has been put forward to help both researchers and clinicians. The work paves the way for further investigation of medical textual data in order to correlate the nding with patient records as registered in electronic medical record. Besides, the analysis also highlights the challenge of accommodating standard NLP tools to Finnish language where more work is needed in order to match the performance of the tools in English text. Especially, the concept of textual argument and text summarization, in continuation of our work in [39][40], can be employed to ensure some inherent consistency and coherence to the textual outputs.

Declarations
Ethics approval and consent to participate This study was approval by the institutional review board of the University of Oulu Hospital (UO15 2017-2019).

Consent for publication
Not applicable.

Availability of data and materials
Data generated or analyzed during this study are included in this published article and its supplementary information les with either appropriate links or excel data sheet.

Competing interests
The authors declare that they have no competing interests Overall approach for data collection   Co-occurrence detection and topic modeling Page 19/22

Figure 5
Histogram of most co-occurring terms and comparison with cancer-related terms with distance Page 20/22 Figure 6 Overall architecture of psychosocial category ontology generation  Sample output on database (Firebase)

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.