REDbox: a comprehensive semantic framework for data collection, management and sharing in tuberculosis research

Background: The outcomes of a clinical research directly depend on the correct denition of the research protocol, the data collection strategy and the data management plan. Furthermore, researchers often need to work within challenging contexts, such as in Tuberculosis services, where human and technological resources for research may be rare. The use of Electronic Data Capture systems can help to mitigate such risks and to enable a democratic environment to conduct health research and promote results dissemination and data reusability. Methods: The proposed solution was based on needs pinpointed by researchers, considering the lack of an embracing solution to conduct research in low resources environments. REDCap was used for research data storing and its management. KoBoToolbox enables forms building and online and oine data collection. Semantic annotation is applied for promoting data integration and availability. Results: The REDbox framework was built to enhance data collection, management and sharing in tuberculosis research, while providing a better user experience. Metadata was dened to enable the integration of both systems. A converter module enables compatibility of forms in both systems. Data collected in KoBoToolbox forms are instantly submitted to an ETL processor, which extracts and transforms the data to be loaded into REDCap. A data quality module facilitates the management of data by reducing the workload of time-consuming and delicate tasks. A service provides practical tools to enhance the use of ontologies and support the continuous integration of different data sources. Conclusions: The relevance of this article lies in the innovative approach to support TB research during collection, management and dissemination phases, which is often carried out in contexts with few human and technological resources. REDCap presents a better approach to the whole research life cycle, but has some usability concerns. On the other hand, KoBoToolbox natively works online or oine, without any additional software. Therefore, when focusing on positive aspects of each tool, it is possible to underpin tuberculosis research by improving data collection, management capability and security. Furthermore, the aggregation of raw the of research


Background
Data collection is one of the most crucial moments in all types of research projects, and it could drive a project to success or failure. The lack of quality of data sometimes is noted when the collection phase is over or almost over. To avoid this, in addition to a trained data collector [1], it is essential the use of a reliable data capture system.
Additionally, the success of a clinical research directly depends on the correct de nition of the research protocol, the data collection strategy and the data management plan [2]. These elements drive the quality and reliability of the collected data that will be used for analysis of outcomes of a given study. Page 3/17 The adoption of new methods, tools and sources of data have changed the way research is conducted.
However, new challenges have arisen, demanding innovative approaches to collect, manage, and publish data. Well-managed data are easier to use and analyze towards the con rmation of a research hypothesis. Also, the reuse of data in further studies is enhanced. In order words, it stimulates more collaboration between researchers and maximizes the investment of funders [3].
The use of an Electronic Data Capture (EDC) system can mitigate the risk of storing potentially sensitive data on paper and help to ensure compliance with medical data privacy, security, and regulations, while improving data quality, management capability and reducing time and costs [4]. An EDC system should be capable to work independent of an operative system or proprietary protocols, and be interoperable, i.e., able to communicate with other systems in a transparent and consistent way [5].
Moreover, in health research, researchers need to work within different contexts. From facilities with highend devices available to ones with low availability of resources, such as poor -or none -internet connection or even without reliable electrical power. In the case of Tuberculosis (TB), an infectious and neglected disease [6], resources for research may be rare and the costs to use an EDC could be a limitation. These aspects stand out as barriers for collecting data in TB research and, therefore, making data available for further data-driven studies is crucial to underpin the development of new evidencebased decision-making tools.
Integrating information in larger systems is hampered by the heterogeneity of data formats and data structure. Data must be correctly described to be useful [7]. Then, semantic interoperability is a key consideration in information systems design [8]. It is achieved when one system can understand the context and meaning of information provided by another system [9].
Meaning can be added to data by using ontologies or other semantic standards, i.e., well-de ned vocabularies which allow precise and machine-readable description of knowledge about a certain domain [10]. Ontologies are important in semantic alignment for data integration, information exchange, and semantic interoperability [11]. An ontology is composed of several properties and each one describes a speci c piece of data in the domain being represented [12].
Besides ontologies, simple standards such as the Humanitarian Exchange Language (HXL) help to speed up data processing and create interoperability across data sources. HXL is a project by the United Nations O ce for the Coordination of Humanitarian Affair for the coordination of disaster response with semantic web technologies. It uses simple marking through hashtags and its goal is to contribute to the automatization of processes to improve the information ow for decision makers [13].
In the case of health research, semantic annotation can help describe the data that is being collected. It can be useful to later extract and link different research datasets described by the same vocabulary.
Usually, each study counts with several collection instruments, totalizing hundreds of elds to be lled during the research progress. Manual annotation is always a choice, but automated approaches for semantic annotation is an extremely important task [14].

Objectives
Clinical trials and studies have increasingly started using EDC systems to conduct a range of analysis [15]. In this sense, and considering that Brazil is part of the top 30 high TB burden countries [16], this work aims to present REDbox, a comprehensive framework based on REDCap [17] and KoBoToolbox [18] systems to enhance research data collection and management in low resources elds, while providing a better user experience.
Additionally, REDbox is intended to promote semantic interoperability of TB research data. Therefore, relying in ontologies and HXL to perform semantic annotations, the objective is to automate the design of an instrument based on a given ontology, and the generation of ontologies derived from instruments' schema, as well as to increase the availability of data for further data-driven TB research.

Methods
The scienti c methodology basis for this work is Action Research. It simultaneously assists in practical problems-solving and expands scienti c knowledge, as well as enhances the competencies of the respective actors [19].
The lack of an embracing solution to conduct research in low resources environments, such as TB reference services in Brazil, led to the conceptualization of the approach proposed in this work. None was found in the literature and after rounds of discussions with researchers, overcoming existing technological barriers in TB services was de ned as the main challenge to be faced, such as poor internet connection, unavailability of devices, and staff with low training in digital tools. For a validation phase, REDbox is currently in use in ve cross-institutional TB research projects in Brazil. Also, it is demonstrated how the use of semantics can promote data reusability and interoperability of research data. Therefore, the following research questions were de ned: "Is it possible to deliver a tool for research data collection and management to be used in low resources environments, such as in tuberculosis services?" "How to promote data interoperability to increase availability of TB data for researchers?" The solution is relevant because it may: 1. Improve the collection and analysis of research data during the whole study period; 2. Facilitate the management of research events and data; 3. Increase the user experience by combining positive aspects of existing solutions; 4. Increase security of research data; 5. Remove technological barriers by delivering an approach that works on any device and without internet connection; . Remove cultural barriers, such as the lack of con dence of researchers to drop paper-based methods; 7. Promote semantic interoperability of collected data for data reuse and record linkage.

REDCap
REDCap is a web-based metadata-driven software built in 2004 by a team at Vanderbilt University to enable classical and translational clinical research, basic science research and general surveys, providing researchers with a tool for the design and development of electronic data capture tools [17] [20]. REDCap is a free software, but it is not considered open-source. A license is required and it can be installed and managed by a small IT team [21].

KoBoToolbox
Developed by the Harvard Humanitarian Initiative, KoBoToolbox is a free, open-source suite of tools for data collection and basic analysis. It was initially built for use in challenging environments in developing countries [18]. Although it presents more basic functionalities, the software delivers modern styles and allows users to work o ine.

KoBoToolbox is powered by Enketo open-source project [22] and offers online and o ine forms
availability to be used in any modern browser, thanks to HTML5 features. The software relies on the XLSForm standard, which simpli es the authoring of forms in spreadsheets in a human readable format [23]. A visual and intuitive form builder is available, or forms can be imported as XLS les.

Semantic annotation
To better represent collected data, elds in research forms can be annotated with semantic vocabularies. REDCap offers the possibility to include annotations for each eld, which will not be displayed on the form or survey, but will be available to the designer and in data exports to help understand the data [20]. This annotation can be a property of an ontology or an HXL hashtag, depending on the user's preference.
KoBoToolbox natively supports the use of HXL. When authoring an XLSForm, the user must simply insert one extra column in the spreadsheet and ll it with HXL hashtags identifying the type of information in each column. The form builder also provides an intuitive way to relate a hashtag with a instruments' eld.

Results
The framework was developed using PHP v7.4 scripting language [24] is composed of ve modules, as follows: i) a metadata database and an Admin System; ii) a form converter; iii) an ETL (extract-transformload) processor; iv) a data quality module; v) and the Ontology Services. Figure 1 shows the REDbox framework overview.

The metadata database and the Admin System
Page 6/17 The web-based Admin System was developed in C# [25] and JavaScript [26] programming language to easily manage the mandatory metadata through create, read, update, and delete (CRUD) operations. Figure 2 presents the relational database model.
In general, rst an entry to a REDCap project must be created (table redcap_project), including the Application Programing Interface (API) parameters and, then, each project's instrument must be registered (table redcap_forms). To initiate the process, the user must upload the spreadsheet (.xls) or the ontology (.owl) le, ll the form name, and choose between generating a .zip le, to manually upload it into REDCap, or automatically importing the form through the API. In the second option, the API Token and URL must be provided. Figure  3 shows the user interface of the converter.
Deriving from ontologies. Each property of a given ontology can be converted to elds in forms. The name and type of a eld is obtained from the name of the property and the associated type (text is the default type). Minimum and maximum values de ned as restrictions on properties are also converted.
Converting from XLSForms. The converter supports all common eld types, such as: text, date, date and time, time, integer, decimal, calculation, single selection, multiple selection, les and notes. These types of elds will be converted as they are, including the variable name and values assigned to options in single and multiple selections, so instruments will have matching structure on both systems. Skip logic de ned on KoBoToolbox is translated to REDCap branching logic, as well validations rules.
In the designing process, there is a particularity related to multiple selection questions (checkboxes). This type of question needs to have the eld's name starting with 'checkbox_'. This is needed to ensure a correct identi cation of a multiple selection question structure during data transfer from KoBoToolbox to REDCap.
Before starting the conversion process, the naming convention will be pre-checked by the converter module. If any inconsistency is detected, the conversion will fail, and the user will be informed with the detected error.
The ETL processor After converting the instrument and transmitting it to REDCap, KoBoToolbox native REST Services must be enabled in the form settings to instantly submit collected data to the ETL processor through a POST request. The processor URL and basic HTTP authentication credentials must be provided.
The processor receives the data collected in KoBoToolbox as a JSON object, which is parsed to remove unnecessary elements that are not related to the data of interest. After verifying authentication credentials, the metadata is queried to obtain the URL and the token of the REDCap API (table  redcap_projects) and to verify if it is the rst form in the project (table redcap_forms). If it is, a request is sent to REDCap API to generate a new record ID, which means that it is a new participant in a research project. Otherwise, the record ID will be searched in the log of collected data, based on the participant identi er. Then, a request is sent to the REDCap API to import the data.
After successfully saving the data, additional steps may take place depending on the settings de ned for the instrument, such as: sending of e-mail noti cations (both for the respondent and the research team), veri cation of duplicity of records, and the instant lock of the saved record (to avoid changes in the data).
These are useful features that may facilitate the management of research data.
Once the data is in the REDCap database, changes in records are monitored through the Data Entry Trigger module, which can detect any changes. When it occurs, the processor exports the edited data from REDCap and logs it into the relational database.

Data Quality Module
Data management is a continuous process and represents a critical phase in clinical research, due to its importance to the generation of high-quality and reliable data for statistical analysis, which should meet the protocol-speci ed parameters and comply with the research protocol requirements [27].
It is crucial that the management activities occur in parallel with the data collection. The data manager usually carries out a data validation process, which includes the veri cation of the consistency, completeness and accuracy of collected data. That way, it is expected to avoid missing data and an increase in quality.
Most data are acquired during participant's visits in a health research. Therefore, keeping track of the schedule of visits and their status (carried out, not carried out, pending) are essential for not missing any milestone.
However, all of these tasks are time consuming, because they demand a careful inspection of a signi cant amount of data. The REDCap software natively offers useful tools to help data managers and researchers, such as the Resolution Work ow and the Scheduling features, which allows the opening of queries to request the veri cation of the collected data and assists in the scheduling of expected visits for participants during the study (although it requires a manual setup for each participant), respectively.
The Data Quality Module is composed of two functionalities that can complement the ones offered by REDCap, focusing on the reduction of the workload for data managers and researchers.
First, there is an automatic rule-based validation procedure that goes through each eld in all instruments searching for any inconsistency. Rules must be pre-de ned as metadata and they represent the format or range of values expected for a given eld. The procedure runs several times a day to check, at the same time, for new issues and to verify the resolution of previously identi ed ones. When an issue is detected, a query is opened in the Resolution Work ow (in REDCap) and the data collector is alerted by e-mail. Figure  4 presents the dashboard with an overview of all issues detected in a REDCap project.
Additionally, a panel was developed to provide a quick visualization of all upcoming participants' visits. Each row in the panel is a participant and each column a visit. The color of cells represents the status of a visit (green: carried out; red: not carried out; yellow: pending/waiting for the participant). Dates are calculated based on a reference date eld (e.g., the day of an intervention or inclusion in the study) and in the days offset for each event. This information is also stored as metadata.
The panel is created in real-time with online data extracted from the REDCap database, saving time of researchers that usually create their own panel using spreadsheets. Figure 5 shows the panel for a study with 21 visits (project IV in table 2).

Ontology Service
The solution offers a service that provides practical tools to enhance the use of ontologies in the system and allow the continuous integration of different data sources, able to adapt to the evolution of ontologies and ensure availability and avoid data loss.
As previously stated, the form converter is able to derive an instrument from an ontology. In a similar way, this service enables the creation of an ontology based on an instrument. This feature relies on an external application, namely the D2R Server [28,29]. The D2R is a tool that converts relational contents in semantic formats, allowing a quick conversion between these formats by automatically creating ontologies based on the schema of the content.
Relying on this feature, REDbox can de ne an ontology from a data collection instrument. To achieve this, a temporary table is created on a relational database, where each column represents a eld in the instrument. Then, the D2R generates and publishes an ontology using the table structure, i.e., converting columns to properties, which can be later customized. Table 1 presents an example of an ontology generated from an instrument containing patient's treatment data. The Ontology Service guarantees the semantic interoperability between the applications and formularies that use different versions of the same ontology or even between different ontologies by maintaining the history of changes and mapping the concepts from one ontology version to another. There are a few features that this piece of software contemplates: upload of a le containing the source term of one ontology version and correspondent target one in the new ontology version; upload annotated les with one ontology version and convert them to an older/newer version of the same ontology; or upload a marked-up le with an ontology and convert it to a le of correlated ontology that was previously aligned/mapped.

Validation
The validation of the proposed solution is performed by its use in several cross-institutional research projects related to TB in Brazil, namely: i) Longitudinal Study of the Impact of Social Support on Tuberculosis Indicators -ELISIOS; ii) Validation of the Line Probe Assay's performance as a rapid diagnosis method for drug-resistant tuberculosis in reference centers in Brazil; iii) Validation of Recombinant PPD in the Diagnosis of Tuberculosis Infection; and iv) ProBCG -Use of the Bacillus Calmette-Guérin (BCG) vaccine as prevention of COVID-19 in health professionals. Table 2 shows the characteristics of each project that are currently using the framework. It is possible to note that there is a signi cant number of instruments and elds on each project. That is to say that the form converter module is crucial in this scenario, where each form needs to be designed only once in KoBoToolbox and, then, converted to the REDCap format. The expected number of records is also signi cant, which may demand the use of easy-to-use and o ine tools

Discussion
The relevance of this article lies in the innovative approach to support TB research during collection and management phases, which is often carried out in contexts with few human and technological resources. These phases can be improved through the REDbox framework, which offers useful tools and a better user experience based on the integration of the REDCap and KoBoToolbox EDC systems and the use of semantics.
Although REDCap presents a better approach to the whole research life cycle, some usability concerns and o ine availability could be a signi cant drawback. REDCap has a mobile app that enables o ine data collection, but it may not be enough, due to the dependency of smartphones and/or tables availability in research centers, the poor usability provided, and the non-compatibility of some advanced features [30]. Also, mobile devices in digital data collection projects are frequently not owned by the people entering the data, which can be considered a risk to be managed [31].
On the other hand, KoBoToolbox natively works directly from a mobile browser, without any additional software. Due to the use of HTML5 features, KoBoToolbox provides a better user experience through modern forms styles and a way to work o ine if needed, without the use of any additional application, such as mobile apps.
Semantics. The semantic annotation can underpin the exchange, use and integration of data from different sources thanks to the aggregation of meaning in raw data. In other words, data becomes machine understandable and can be interpreted by distinct systems.
In the research project IV, as shown in Table 2, a semantic integration has been performed using data collected in the research's instruments and in health information systems from the Brazilian Ministry of Health. In this case, demographic and vaccination information were integrated and compared to keep data up to date and increase completeness in the research dataset.
APIs. APIs enable interoperability and data integration between software components and the development of extensions of existing systems.
Regarding REDCap, the API is well-documented, and several endpoints are available, which basically allows managing a whole project programmatically. In this work, some endpoints were used, speci cally to: i) import and export data; ii) import les; iii) generate unique identi ers (record id); iv) import metadata (instruments, elds); and v) export metadata.
In KoBoToolbox, the API is not adequately documented. However, a feature is available to instantly send collected data to an external server in JSON standard. This is very useful when using the system only for data collection, which is the intention of this work, and because it eliminates the need of developing a client to extract data.
Data safety. In general, data is stored in three distinct logical units, such as KoBoToolbox database, REDCap database, and in the relational database. Only data stored in REDCap is intended for analysis, but in case of any failure, data can be easily restored. Finally, the whole process is transparent to the nal user, which can focus only on data collection, management, and analysis.
Limitations. In the form converter, the designer must pay attention in the following aspects: i. need of using a variable naming convention for multiple selection elds (checkboxes). Using a naming convention for variables of multiple selection elds is crucial. Otherwise, data transferring may fail.
ii. calculated elds. When using calculated elds, KoBoToolbox does not allow setting up a label for this kind of eld, unlike REDCap. As a workaround, the designer can use the "Guidance Hint" option, which will be transformed into a label when converted to REDCap format. However, this is not mandatory since REDCap accepts blank labels in calculated elds.

Conclusions
This work has presented REDbox, a comprehensive framework for integrated data collection and management in tuberculosis research. The use of REDCap and KoBoToolbox together has allowed the combination of the advantages of each one, in a transparent way, helping researchers to manage and maintain data while increasing satisfaction of nal users that are responsible for collecting data in the eld. Furthermore, the Data Quality module intends to speed up and enhance data management by reducing the workload of time-consuming and delicate tasks.
Supporting the semantic integration of data is also another important contribution of this work. The addition of meaning in raw data and the possibility to follow the evolution of ontologies through versioning are crucial to promote the quality and the availability of research data over time.
Finally, although the solution was motivated by the TB scenario, it is applicable in other health elds. Availability of data and materials: Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.   Data Quality Module -Visits panel