Governance of data and ethical risks identified
All the systems that involve the collection, curation and analysis of data imply engaging with data governance that encompasses a multitude of social systems (19). The form of data governance privileged by the MOOD consortium promoted a data-sharing approach, but the reality showed that it is less easy than it is supposed to be because it implied adherence to a set of rules and compliance with the existing EU legal systems and non-EU countries. For example, data access to the EFSA platform required an access agreement and reuse agreement through motivated requests. The network, nodes, data flow and connectivity also show a crucial challenge posed by big data analytics: the interoperability of systems that led to several issues such as standardisation related to a complex infrastructure.
The mapping shows that several types of ethical risks can appear at different levels of the path taken by the data.
The issue of privacy is a main challenge, as announced by an ethics consultant in an interview and confirmed by the literature. Owing to the cross-referencing of data, deidentification is always possible (20). Several public health scholars have shown how data brokers commodify personal data for profit purposes (21). As one expert of ethics assesses, merging data collected by researchers with other existing data can lead to unexpected problems or unwanted harm.
Machine learning and its algorithms are important in predictions and real-time analysis. Nevertheless, they could reproduce stereotypes included in a set of nonofficial sources and lead to problems of discrimination or stigmatisation of groups depending on their race or gender. The COVID-19 pandemic generated information with negative views about Chinese ethnic groups, as Wuhan, which is situated in central China, was the first area where early cases were reported, and China was considered the hot zone where the virus and death originated. Owing to the selective nature of data and data-driven techniques, an increasing number of studies have shown that some correlations could serve as proxies for unknown or protected categories that were deliberately unrecorded, techniques that could lead to gender inequities and public health services provided to the detriment of one sexual group (22). Some disease intelligence tools tailored for extracting epidemiological events through media news for syndromic surveillance purposes and disease profiling activities often ignore gender-related trends. Scholars currently and historically investigate how ignoring differences amplifies discrimination (22).
The black box approach that leads to opaque algorithms limits the fundamental freedom and reliability of AI systems related to FAIR principles that are critical for decision-makers; the lack of transparency could impact decisions based on the outputs of AI systems. AI transparency and explainability are the six core ethical principles of the United Nations (23).
Data from social media sources are anonymised. Moreover, information deleted by users from online social media sources remains in the researcher’s database (ex, a tool that extracts unofficial data for syndromic surveillance), which can render datasets obsolete and may pose a critical issue that confronts the right to be forgotten to actual means to do so. The GDPR obliges the legal person who commits the data processing to stop processing inaccurate or incomplete, according to Twitter's terms and conditions, a compliant user API should remove critical data from the database built via Twitter’s API (25). It is not easy to do so when the process is ongoing. In addition, Twitter data access is provided through an API that makes the data publicly available to anyone. An important ethical issue remains when informed consent is not requested for public health surveillance in digital publics that provide research outputs. We can also see the possible existence of sampling biases validated by interviews for representativeness at the level of Twitter users who are not representative of the general population or keyword searches used by the algorithm. A misinterpretation of the content of the tweets is also possible without considering the context of the identified keywords, as mentioned by a project partner during a discussion. An ironic tweet saying “to prefer an Ebola pandemic rather than losing soccer games” was wrongly collected by an algorithm. Similarly, for the societal context, events or restrictions can influence the content and latest trends. These observations confirm the importance of semi-automatisation and the requirement of human supervision during machine learning and text extraction, especially during the pandemic when European disease surveillance agencies lack human resources (24). Unfortunately, Twitter data first used as experimental ground were no longer used in PADI-Web but remained a valuable source of data for ProMED. The lack of human resources to continue monitoring disease trends and strict ethical risks (privacy concerns) may lead European researchers and data scientists to talk about mining social media sources.
The Euro-American tradition of research driven by industry with enormous data centres and infrastructure located in places such as Oxford, California, or Boston is internationally well-known. The approach allows the imposition of criteria that are universally recognised such as the English language used for the majority of algorithm keywords in detrimental vernacular languages (eastern European or African languages). The language orientation of surveillance systems tells us that data are highly driven by politics (15). The vernacular indifference of machine learning may amplify discrimination.
During informal discussions, the expression “radioactive data'' appears several times when we listen to experts and it is immediately related to personal data such as GPS, some confusion has been voiced among researchers who are not perfectly aware of what personal data truly embraces. Personal data are not dangerous by themselves; it is the hands that manipulate data and the objectives and interests, whether commercial, political, or repressive, under the police investigation that could impact people who could for example lose their freedom or have financial losses. Given that the use of big data and AI can be a “double-edged sword” (26), researchers should think critically about what the larger societal consequences of their intervention might be. Building ethical sociotechnical systems requires the shifting of AI ecosystems toward equitable and accountable AI. This became a major objective for the European Commission; researchers were required to seek the consent of users, avoid privacy intrusion, and use data minimisation (17, 27). Knowledge of these risks by researchers is crucial.
Knowledge and role of researchers
The approach of ethics expertise that confronts comprehensive moral guidance versus enforcement of legal compliance (depending on expert background), a kind of “soft” versus “hard” ethics (28) has consequences for some researchers' readiness to cooperate in thinking and tackling ethical issues raised by their tools and sociotechnical innovation.
The answers collected throughout the survey show that the respondents’ level of ethical knowledge varies according to their field of expertise, whether they are researchers or data scientists. The respondents evaluated their knowledge as good for anonymization (Fig. 2), somewhat opposed to how they received and anonymised the data. A few parameters, although ethically important, are considered to adapt the aggregation of the data (Fig. 4A and 4B). Additionally, we have seen above that not all respondents have read or are aware of the GDPR. As partners outside the European Union may not have the same data protection legislation, it is important in the MOOD project to establish common standards for working internationally, nevertheless, partners in Eastern Europe are invited to follow their regulations.
Conditions of access to the MOOD platform may cause a “digital divide” mainly for partners in the southern region because it required special conditions (skills, strategies, instrument information and good internet connections) are required despite the "open source" perspective adopted by the project that makes models and codes available.
Big data is inherently voluminous; the storage allowing them to be available must be able to follow. As seen during interviews and on the MOOD data flow map, all project data are hosted by AWS (Amazon Web Services) servers, an American company. One scientist suggests that the physical location of the server is only sometimes known. The “CLOUD Act” (Clarifying Lawful Overseas Use of Data Act), a 2018 U.S. federal law “expands the geographical scope of possible requests by the U.S. government that can access data on servers, regardless of their location”. This law tends to go against the GDPR even though an international agreement remains mandatory “for a jurisdiction or an authority resulting from an administration to transfer or disclose personal data”.
In the event of disagreement, a U.S. court may provide a warrant if the latter is convinced that the public interest is threatened. In addition, the European legislation has several requirements related to data storage. The cloud service offering must be a public and multi-public provider, hybrid, and energy-efficient, which means lowering the carbon footprint (29).
One of the respondents considered environmental data to be without ethical risks and, therefore, without the need for aggregation (see Fig. 4A). While reflecting on the future of a platform integrating these data, we may assume that considering certain environmental parameters as disease emergence factors could not harm the biodiversity of a place. Indeed, if one wishes to reduce the risk of emergence in the future, this could imply lowering or eliminating certain environmental factors by reducing or shaping part of the biodiversity. When we reflect on the "One Health" approach, it may have an impact on the environment or animals that deserve to be carefully considered by researchers and decision-makers.
Ethics help support
The current advice provided by ethics advisors does not seem to be enough for several researchers. This can also be seen in evaluating the ethical advice in the questionnaire and the averages attributed to each criterion. Figure 6 shows the types of assistance expected in addition to an ethical code of conduct as support. However, 57% of the respondents (9/16) considered that they did not need an ethical code of conduct to perform their job, or possibly in another form.
Study limitations
Several internal documents used to create the data flow map have been updated since then, making some information obsolete as the project changes over time. Therefore, the actor-network is not exhaustive because it was constructed at only one point in time, and this should be considered when interpreting the results.
Regarding the interviews, the timeline for implementing and analysing the data was short. Despite the population target (MOOD partners N = 132), the sample was small (16/132), we obtained few answers, and people had only 22 days to answer, therefore, the results were not statistically significant. Moreover, implementing mixed methods helped to balance this issue, and the methodology approach was intrinsically inductive.
For a more balanced and richer overview of ethical concerns in the governance of data and the use of algorithms in digital health surveillance, it was important to consider the views of end users of the MOOD prototype platform, which were not identified when our study started. Ethics issues related to the use of the prototype platform may be assessed in future evaluations of the program that must incorporate the coming AI act of the EU and the installation of a paywall for Twitter APIs that will considerably change the political economy of open source in academic research. Ethical issues also require being more technically oriented in providing concrete answers to tangible problems.