Sentiment Analysis of User Feedback on the HSE Contact Tracing App

Digital Contact Tracing (DCT) is seen as a key tool in reducing the propagation of viruses such as Covid-19, but it requires uptake and participation in the technology across a large proportion of the population to be e ﬀ ective. While we observe the pervasive uptake of mobile device usage across our society, the installation and usage of contact tracing apps developed by governments and health services globally have faced di ﬃ culties. These di ﬃ culties range across the user-populations’ issues with the installation of the apps, us-ability and comprehension challenges, trust in the e ﬃ cacy of the technology, performance issues such as the e ﬀ ect on battery life, and concerns about data protection and privacy. In this work, we present our ndings from a comprehensive review of the online user feedback relating to the initial release of the HSE Contact Tracker app, in an e ﬀ ort to inform later iterations and thus help sustain and potentially increase usage of the technology. While this might seem quite tightly scoped to the Irish context only, this app provides the basis for apps in many jurisdictions in the United States and Europe. Our ndings suggest a largely positive sentiment towards the app, and that users thought it handled data protection and transparency aspects well. But feedback also suggested that users would appreciate more targeted feedback, more proactive engagement and also suggested that both the ‘android-battery’ issue and the backward-compatibility issue with iPhones seriously impacted retention/uptake of the app respectively.


Introduction
As the coronavirus (Covid- 19) continues to spread globally, governments and public health institutions look to contact tracing to help isolate and contain outbreaks. The more traditional manual contact tracing approach initially adopted in Ireland is both time and resource intensive, and may struggle to identify all contacts quickly enough, before they cause further transmission. In contrast, the efficiency and responsiveness of a digital approach using the proximity sensors in smartphone devices has the potential to limit delay and catch a greater number of contacts [1].
Key to the effectiveness of these digital solutions, however, is the percentage take-up of the app across the population, with one study in the UK [2] recom-mending the epidemic could be suppressed with 80 percent of all smartphone users using the app. This equates to 56 percent of the population overall.

Motivation
Analysis of the public's response to the initial release of the HSE Contact Tracker app [3] can help guide the system's evolution towards maintaining, and expanding, the uptake and ongoing use of the app to ght Covid-19 in Ire-land. It can do so by informing on user re nements to aspects like its usability, functional effectiveness, and performance. The voluntary nature of participa-tion in the use of the app, combined with the requirement for a critical mass of users across the country to make the app effective, makes such feedback a crucial tool in the campaign to defeat the spread of the virus.
To that end, this research [4] [5] analysed all app reviews of the HSE app on the AppStore [6] and Google Play [7] using seven different aspects of inter-est: General Characteristics, Usability, Functional Effectiveness, Performance, Data Protection, Autonomy (of users), and Overall (generic comments). This analysis focused on 'positive' and 'negative' sentiment expressed by the user under each of these aspects, in order to build on areas well received, and to target areas where future releases of the app could be re ned.

HSE App Overview
Ireland's Health Service Executive released the COVID Tracker app (see Fig 1), developed by Nearform, across the Apple and Google online app stores in early July 2020.
Built on the Google and Apple Exposure Noti cation API (GAEN), it uses Bluetooth and anonymous IDs to log any other phone with the app it is in close contact with -tracking the distance and the time elapsed. Every 2 hours the app downloads a list of anonymous IDs that have been shared with the HSE by other users that have tested positive for Covid-19.
If a user has been closer than 2 metres for more than 15 minutes with any of these phones they will get an alert that they are a close contact. The app runs in the background Beyond the core contact tracing technology lies additional voluntary self-reporting functionality -users can choose to log daily health status or symp-toms via the Check-In option, and also to share their age, group, sex and locality. Also optional is the ability to share a contact phone number so the HSE can contact them.

Pillar Derivation
The analysis process in this research involved coding user-reviews into 7 as-pects, henceforth called pillars: General Characteristics, Usability, Functional Effectiveness, Performance, Data Protection, Autonomy (of users), and Over-all (generic comments). These pillars were derived and re ned through an iterative 6-phase process: -A bottom-up approach, where individual contact tracing applications were evaluated for derivation of important app characteristics; -A parallel academic/grey(media) literature review of app/health-app eval-uation papers, to the same end; -Cluster analysis, to create an amalgamated framework that revealed dis-tinct super-categories (the pillars). This was re ned via team review for redundancy and sufficiency; -A Devil's Advocacy phase where individual 'pillar owners' were challenged by another member of the team to assess the characteristics in that pillar for sufficiency, relocation, and relevance. This was followed by a full team critique of the pillars against the same criteria.
-Application of the resultant framework against the HSE app, to evaluate the framework further, leading to re nement of the pillars and character-istics.

Research Questions
In order to inform the evolution of the HSE Covid Tracker app going forward the following two research questions were formulated: An ancillary analysis also probes the commonalities and differences be-tween Apple and Android users to assess if there are any platform-speci c issues that arose and to see how common the pro les are across the two sets of users.

Structure
In the next section, we discuss the method followed for data gathering and analysis, and then we present our results. Finally, the discussion section focuses on our ndings, and potential recommendations for improving the efficiency of the app towards limiting the Covid-19 pandemic.

Methodology
In order to ensure that the data for the analysis was ecologically valid, it was decided to focus on naturally occurring reviews only. There is no discus-sion/issue forum on the HSE Contact Tracker's GitHub page [12] but user reviews are available on the Google Play store and the iTunes App store and these have these reviews been used by other researchers in similar analy-ses [13] [14] [15]. Hence, a python program was developed to scrape reviews of the HSE's Covid Tracker app from both the Google Play store and the iTunes App Store. This script was executed on the 13th of August 2020 and scraped all reviews up to that date. It should be noted that, at this point in time all the reviews were for version 1.0.0 of the app.
The reviews, thus scraped, were populated into a .csv sheet, which was further converted into .xlsx format for ease of analysis. This spreadsheet is made available for interested readers at [16]. The extracted le was cleaned of duplicates, sorted alphabetically and contained the following elds: -The associated textual review.

Analysis
Manual sentiment analysis was subsequently performed against the 7-pillar evaluation framework described above. These pillars are referred to in the article by their acronyms: General Characteristics (GC), Overall comment (O), Functional Effectiveness (FE), Usability (U), Data Protection (DP), User Autonomy (A), Transparency (T), and app Performance (P). The sentiment analysis was performed manually because, even though there have been huge improvements to automated sentiment analysis in recent years [17] [18], the precision and recall rates achieved are still not perfect [17] [18] and this would be exacerbated in this instance because here 'negative sentiment' aims to cap-ture not just reviews with a negative tone but also, quite positive reviews that request a speci c re nement or modi cation.
Each review was randomly allocated to one of four reviewers, the overall al-locations to each reviewer being of equal size. These reviewers were tasked with independently segmenting each review into a set of positive and negative user observations, and classifying each of these observations into their appropriate pillar: essentially a form of content analysis that allows "for the objective, systematic and quantitative description of the manifest content of communi-cation" [19] [20]. Here, as mentioned above, 'negative user observations' refer to both comments with negative sentiments and comments suggesting re ne-ments.
A joint discussion session at the start of the analysis ensured that all re-viewers had a common understanding of the seven pillars and the sentiments being sought. The subsequent analysis resulted in three new elds being in-corporated into the spreadsheet: -The text segment where a positive, neutral or negative sentiment was de-tected; -The sentiment (positive/negative/neutral) associated with that text seg-ment; -The associated pillar.
After the coders had individually coded the reviews in this fashion, one author was charged with assessing the entire coding for interpreter drift and sentiment inconsistencies. Interpreter drift is where a coder's coding drifts over time [21]. For example, in a coder's initial coding, they may classify a review segment complaining of "the lack of more detailed feedback on the location of cases" as a 'usability' issue. But, after fatigue has set in, they may note it as a 'Performance' issue. In such cases, the author charged with assessing interpreter drift corrected the drift by re-categorizing the latter comments in line with the categorization of the original comments (in the above example the lack-of-feedback comment would be consistently referred to as 'usability').
In terms of sentiment inconsistencies, there were (14) occasions where a reviewer very obviously ticked the incorrect sentiment. In one case, for exam-ple, a user complained of battery drain and the coder incorrectly categorized the sentiment as positive. These clear-cut mistakes were also recti ed by the author charged with assessing the coding.
In order to assess inter-coder reliability, approximately one seventh of the reviews were coded by more than one reviewer. Figure 2 presents a snippet of the coding spreadsheet, illustrating several user-comments where more than one reviewer coded the reviews.
Inter-coder reliability could then be assessed using these reviews and Co-hen's Kappa [22]. The protocol was as follows: The text segments both coders identi ed for each review were re-ordered so that, where possible, they were in the same order across coders. Then the pillars and sentiment for each com-mon segment were compared. If they were the same for the same segment, the coders were considered 'in agreement'. If either the pillar or sentiment were not the same, the coders were not considered in agreement. Likewise if one of the coders noted a segment that the other did not, then this was considered another disagreement between the coders [23]. Subsequently the individual pillars, sentiment and segments of the entire data-set, as coded by the main coder (not the coders who coded reviews solely for the purposes of the Kappa analysis) were analysed for trends and themes across the reviews: Pillars and sentiment were assessed quantitatively to iden-tify the prevalent types of concerns expressed in the data set. Segments and sentiments were analysed to identify the individual concerns expressed by the app reviewers. The results of this analysis are now presented.

Results And Findings
In total 1287 comments were coded, 998 from Android users and 289 from Ap-ple users. Table 1 presents the total number of identi ed comments per pillar and those totals broken down by positive/negative sentiment. In terms of users' overall perception of the app, the data suggests that they were largely support-ive and and, to a lesser degree, on the concept of the app ('its a no brainer', 'great idea'). The ratio of implementation:concept comments was approximately 5:1 in the reviews.
As Table 1 shows, most of the 'negative' comments were aimed at per-formance and usability. The prevalence of performance comments can largely be explained by an Android battery-drain issue that arose on August 8th, So essentially all of the negative performance comments were due to this Google update. If this issue was excluded from the analysis, the performance comments would have been entirely positive. However, this seems to have been a very serious issue for Android phone users: Of the 365 reviews associated with these comments 102 mentioned the word 'uninstall', suggesting that well over a quarter of those who complained about the issue considered (at least) uninstalling the app.
Interesting also are the data protection and transparency pillars. Users seem to perceive that the HSE has done well on both fronts with 32 of 44 data-protection comments having a positive sentiment and 14 of 23 transparency comments having a positive sentiment. This sentiment trend was consistent across both Apple and Android devices.

Usability Comments
The prevalence of usability issues across the reviews is to be expected given the forums involved: Users are most typically interested in usability [24]. Their usability feedback is detailed in Figure 4.
The main usability issue, as determined by the number of user suggestions (90), was around the feedback provided by the app regarding the occurrence of Covid across the country. Version one of the app focused on the number of cases in total at national and county level. Users felt that feedback on Covid cases should focus on a more recent time range (36) on active or newly-found cases only (24) and on ner granularity, in terms of geographical location (21). Of this latter category, seven reviewers suggested that hot-spots be identi ed, but this would be difficult in terms of maintaining the current privacy standards the app embodies.
An interesting idea that arose in two reviews was that the app should also report on the number of close contacts users had per day, where the user could get daily feedback and thus try to minimize that 'score' over time. This is analogous to the gami cation concept of 'streaks' [25] where a user might aim to keep their number of close contacts below a certain daily threshold, over time, and thus continue their 'streak'.
Another prevalent theme in the usability pillar was a desire of noti cations for daily check-ins, where the user self-reports their health status to the app, ideally on a daily basis. 70 comments requested this enhancement or expressed dissatisfaction at it not currently being available in the app.
A surprising nding was that 20 reviewers complained that a town or area was not available for selection when they were pro ling themselves during app set-up. Often this was their own town/area, but in ve cases reference was made to the exclusion of the six counties in Northern Ireland. Another reviewer noted that it would be interesting to have the possibility of recording two areas, where a user works in one location but lives in another.
Finally, iPhone users complained, in signi cant numbers, about the apps inability to work on older versions of iOS or older iPhones (the iPhone6 partic-ularly). Overall, 54 of the 289 apple comments were targeted as this issue: by far the most prevalent focus of iPhone users' concerns. This is unfortunate be-cause, it represents a signi cant number of potential users who want to install the app but cannot. In addition, as in the case of the battery-drain issue, this is entirely outside of the HSE's/NearForm's control: For this to be addressed, Apple would need to incorporate backward compatibility into the associated operating systems. The last column of Table 1, shows the 'negative' sentiment corrected for these outside-of-the-HSE's-NearForm-control issues.

Discussion
In general, public sentiment seem positive for the HSE's Contact Tracking app. The overall comments are largely supportive and, on the critical aspects of data protection, and transparency, public opinion seems favorable, as assessed by the Google Playstore and iTunes App reviews.
Below that positive sentiment though there are some prevalent user re-quests/concerns that should be addressed. However, the most prevalent platform-speci c concerns (and indeed two of the three most prevalent concerns across platforms) are outside of the developers control. The battery drain issue was caused by an update to an underpinning Google library, and was remedied quickly, although not before substantial reputational damage had been done in terms of the public's perception of the app. Likewise, the backwardcompatibility issue for older versions of iOS is a signi cant issue for potential users and can only be addressed by Apple.
Something that the HSE/Nearform have already worked on, in their new versions of the app, is the information the app gives about the spread of the disease. They have tackled the desire for more timely information with information now provided on new cases within the last day, two weeks, and two months at a national level.
They have also expanded the information available at county level, where again users can see the new cases for the last day, two weeks and two months.
But they have not increased the granularity of information in terms of geo-graphical location: That information is still presented in terms of the national picture and in terms of counties. One reviewer suggested that the facility of-fered by the Central Statistics Office of Ireland, where Covid cases are reported by electoral division(https://census.cso.ie/covid19/), should be incorporated into the app. But the statistics presented on this website are for the 20th of June, and so are out of date now/could not be used.
Another aspect prevalent in the reviews, and one that might facilitate in-creased user engagement, was the use of daily noti cation alerts to remind users to check-in with their health status. This, combined with the improved disease-spread feedback provided by the app, and maybe gami cation of as-pects like recording daily close contacts, may encourage users to retain the app and engage with it for longer time periods [25].

Limitations and Threats to Validity
In terms of reliability, the analysis was done by different coders, so there may be a bias or an inconsistency in that coding. To mitigate against this possibility the four coders were all given an introductory session where 15 illustrative examples were coded and discussion undertaken to form a common understanding. The inter-coder reliability assessment undertaken in this study suggests that this introductory session largely achieved its goal, as discussed in section 2, and that the reliability of the analysis was good.
A construct validity issue [26] is that the data obtained may not be to-tally correct: User opinions may be informed by hearsay and users are not always in an ideal position to report on quality aspects like data protection, or performance [27]. To mitigate against this the analysis focused on sentiments that were more pervasive across the data-set and brought to bear considerable knowledge of the app itself, the researchers having studied it as part of the project goal to derive the pillars beforehand.
Another external validity issue [26] is that this dataset obtained is only a very small sample of the user population. Even considering that a large number of users may have uninstalled the app after the Android battery issue, and the currently quoted download gure of 1.97 million is thus somewhat in ated, it is likely that there are well over 1.3 active users (this is quoted in the app's User Interface). While a sample of 1287 comments is small by comparison, it can be considered representative [28]. However, a more serious, related external validity issue is that that sampling protocol in this case is self-selecting, where only vocal reviewers (those that left reviews) are included in the data-set.

Conclusion
This report has focused on sentiment analysis of all the reviews that were available on Google Play and Apple collectively before August 14th 2020, towards helping evolve a better contact tracing application to ght the Corona virus. The results suggest that the app is well perceived and seen as sensitive to data-protection concerns. But the reviews also suggest that the feedback it provides regarding occurrence of Covid-19 should be made more granular in terms of location, and that the option of noti cations should be included, to remind users to lodge their health status every day. Additionally users seemed upset when their location was not available in the app. Finally, efforts should be directed at prompting Apple to make their Exposure Noti cation Service available to older iPhones and older versions of iOS.
But some of the suggestions made by users have already been addressed in the next version of the Covid Tracker that was launched by the HSE. Speci -cally feedback on the status of Covid 19 has been made more current in terms of time-span reported and current cases. This consilience, between our results and HSE updates, not only strengthens the efficacy of our other ndings, but also suggests that the health authorities in Ireland are evolving the app in a direction sympathetic to user concerns.
While the analysis proved to be helpful in understanding the sentiment behind the public's opinion regarding the HSE's Covid Tracker, an automated analysis of users' sentiments using arti cial intelligence could also be developed [29]. This would facilitate understanding wider public opinion about the app over larger data-sets, in a much more time-sensitive manner.
Our future work will be in this direction and already we have trialled an initial approach. The existing reviews scraped and extracted into a .csv le have been cleaned for special characters and unnecessary symbols using Alteryx. Sentiment analysis has been done using R Studio shown in Fig 5. In this context, sentiment analysis is the computational task of automati-cally determining what feelings a writer is expressing in text [30]. Sentiment is often framed as a binary distinction (positive vs. negative), but it can also be a more ne-grained, like identifying the speci c emotion an author is expressing (like fear, joy or anger) [18].
However, we are conscious that sentiment analysis alone may only help us focus our analysis on dissatis ed users/comments [31]. It will not identify the pillar of interest, identify the users' speci c issues or, indeed, the prevalence of those issues across the data-sets. So additional analysis may be required to facilitate derivation of the users' issues [32] .

COMPETING INTERESTS
The authors declare no competing interests.