A Case Study of the Development of a Valid and Pragmatic Implementation Science Measure: The Barriers and Facilitators in Implementation of Task-Sharing Mental Health Interventions (BeFITS-MH) Measure

Background: Few implementation science (IS) measures have been evaluated for validity, reliability and utility – the latter referring to whether a measure captures meaningful aspects of implementation contexts. In this case study, we describe the process of developing an IS measure that aims to assess Barriers and Facilitators in Implementation of Task-Sharing in Mental Health services (BeFITS-MH), and the procedures we implemented to enhance its utility. Methods: We summarize conceptual and empirical work that informed the development of the BeFITS-MH measure, including a description of the Delphi process, detailed translation and local adaptation procedures, and concurrent pilot testing. As validity and reliability are key aspects of measure development, we also report on our process of assessing the measure’s construct validity and utility for the implementation outcomes of acceptability, appropriateness, and feasibility. Results: Continuous stakeholder involvement and concurrent pilot testing resulted in several adaptations of the BeFITS-MH measure’s structure, scaling, and format to enhance contextual relevance and utility. Adaptations of broad terms such as “program,” “provider type,” and “type of service” were necessary due to the heterogeneous nature of interventions, type of task-sharing providers employed, and clients served across the three global sites. Item selection benefited from the iterative process, enabling identification of relevance of key aspects of identified barriers and facilitators, and what aspects were common across sites. Program implementers’ conceptions of utility regarding the measure’s acceptability, appropriateness, and feasibility were seen to cluster across several common categories. Conclusions: This case study provides a rigorous, multi-step process for developing a pragmatic IS measure. The process and lessons learned will aid in the teaching, practice and research of IS measurement development. The importance of including experiences and knowledge from different types of stakeholders in different global settings was reinforced and resulted in a more globally useful measure while allowing for locally-relevant adaptation. To increase the relevance of the measure it is important to target actionable domains that predict markers of utility (e.g., successful uptake) per program implementers’ preferences. With this case study, we provide a detailed roadmap for others seeking to develop and validate IS measures that maximize local utility and impact.


BACKGROUND
Most implementation science (IS) measurement development has been done in high-income health system contexts such as the US, UK, Australia, and select European countries (1).Because of this limited contextual focus, current IS measures tend to be less applicable in low-and middle-income countries (LMICs) with different cultural contexts and health and economic systems.Among the key differences in health care systems between high-, middle-, and low-income countries are the role of insurance and payment mechanisms (2), and for mental health care speci cally, in LMICs the limited availability of secondary and tertiary mental health care facilities (3) has resulted in a greater reliance on non-specialist mental health providers (e.g., community health workers, peers) (4).Although there has been some growth in IS measure development for use in LMICs (5), the widespread use of measures developed speci cally for these contexts, as well as pragmatic examples of the process of developing such IS measures, remain limited.
Standards exist for rigorous measure development and evaluation.Key criteria include de ning the concepts of interest (i.e., constructs) based on relevant theory (known as "content validity") and conducting appropriate analytic tests to assess reliability (i.e., whether measures are consistent) and validity (i.e., whether measures assess what they propose to measure) (6).Many IS measures have been limited by a lack of clarity in theory or conceptual frameworks and heterogeneity in operationalization of relevant concepts.Illuminating this gap, a review found that the majority of IS measures, in addition to showing insu cient content validity, either did not provide su cient information about, or were unsatisfactory in multiple psychometric properties (7).In addition, rich and detailed descriptions of the process by which IS measures capture information that is relevant to implementation processes in global contexts also remain lacking.Furthermore, few IS measures have been fully evaluated in terms of their pragmatic properties.According to Glasgow and Riley (8), important criteria for pragmatic measures include, among others: important to stakeholders, low respondent burden, actionable, sensitive to change, broadly applicable, and can serve as a benchmark.Efforts to establish criteria to evaluate pragmatic properties of IS measures have yielded substantial conceptual clarity and are pushing the eld of IS measurement development forward to achieve greater scienti c rigor and practical impact (7,(9)(10)(11).Nevertheless, still largely missing in the literature is a detailed account of the process of developing and validating a pragmatic IS measure, including regarding how stakeholders including program implementers are engaged to enhance the measure's utility, a key property de ned as whether a measure and its items account for the meaningful aspects of the implementation contexts (e.g., cultural relevance, environmental resources, and program processes).

A Lack of IS Measures for Task-Sharing Mental Health Services
Currently there is a global push for the scale up and integration of mental health services to reduce the mental health treatment gap worldwide.The 2018 Lancet Commission on Mental Health and Sustainable Development Goals (12), Grand Challenges in Global Mental Healt (13), and several systematic reviews (4,(14)(15)(16)(17)(18)(19)(20)(21) all strongly advocate that effective implementation of task-sharing strategies can help narrow the treatment gap that is particularly prominent in LMICs.Task-sharing involves the formalized redistribution of care typically provided by those with more specialized training (e.g., psychiatrists, psychologists) to individuals, often in the community, with little or no formal training (e.g., community/lay health workers, peer support workers) (22).A growing number of e cacious task-sharing mental health interventions exist and can take diverse forms, including but not limited to: utilizing primary care workers to detect and/or deliver mental health care (23)(24)(25); training community health workers to administer psychotherapy interventions for people with common mental disorders (23,26); and using communitybased workers or peers to provide access to medications and rehabilitation services for people with serious mental illness (27,28).
Despite the expanding evidence base, we lack a robust understanding of the barriers and facilitators that contribute to implementation success and what these look like across diverse task-sharing mental health interventions and contexts, which is needed to ful ll the promise of task-sharing in addressing the mental health treatment gap (29).The lack of valid and pragmatic IS measures to identify these barriers and facilitators (i.e., 'implementation determinants') (30) (32); and (III) integration of mental health services alongside stigma reduction in primary care in Nepal (Reducing Stigma among Healthcare Providers [RESHAPE]) (33).Table 1 presents further information about the task-sharing mental health interventions being implemented in each site.Speci cally, this case study illustrates how to develop a measure that has contextual relevance and utility across diverse task-sharing mental health programs and settings, and how to engage stakeholders in assessing the construct validity and pragmatic utility of implementation outcomes measures.

Process of Developing the BeFITS-MH Measure
We undertook an extensive process to develop the BeFITS-MH measure.First, we developed a multi-level conceptual model to guide our understanding of the domains of barriers and facilitators associated with task-sharing mental health interventions.Second, we further speci ed the conceptual model using two data sources: the Shared Research Projects (below) and a systematic review.Based on the results of this model building process, we constructed the initial draft of the BeFITS-MH measure.The measure was revised through expert feedback from a Delphi panel.Further re nements of the BeFITS-MH measure were done during the translation and local adaptation stage.Finally, we conducted concurrent pilot testing procedures to nalize the BeFITS-MH measure within the three study site programs.
Theory-driven and empirically-grounded measure development.
In developing our theoretical model, we selected the Consolidated Framework for Implementation Research (CFIR) (35,36), and the Theoretical Domains Framework (TDF) (37,38), which together allowed us to enumerate and categorize a wide range of potential implementation determinants-i.e., 'barriers and facilitators' (39).In addition, we also drew on Chaudoir et al.'s framework (10) Delphi process.
To re ne the BeFITS-MH measure and to arrive at an expert group consensus of the measure's core initial domains, format, and structure, we conducted what is known as a 'modi ed Delphi process'.Our modi cations re ected group sessions that provided opportunities to discuss differences in responses.First, we assembled a 'Dissemination Panel' of 19 global experts in implementation of task-sharing mental health interventions and health services research particularly in LMIC settings, including the study co-investigators at the three sites (South Africa, Chile, Nepal).Over a period of ve months, the panel met in three virtual forums (2-hours each), interspersed with two rounds of online questionnaires where panel members were asked to individually provide feedback about different aspects of the BeFITS-MH measure (e.g., the construct and content validity of the domains [subscales], cultural and linguistic appropriateness of the items, hypothesized relationships of subscales to implementation outcomes).All questionnaire responses were compiled and discussed at the following virtual forum.
Field based translation and local adaptations.
Following the Delphi process, we held regular biweekly virtual meetings with the lead BeFITS-MH measure developers and co-investigators from each of the three study sites to translate and locally adapt the BeFITS-MH measure.Within each site, we opted for a group translation process, wherein 2-3 local staff (researchers, clinicians, task-sharing providers, and program implementers) were consulted to jointly translate the measure.This collaborative process has been identi ed as particularly important for mental health problems and programs, where assessments of emotions and behaviors need to be aligned with local understanding and conceptualizations (43,44).Along with the translations, site-speci c adaptations included using appropriate terms describing the target mental health problem and tasksharing mental health intervention being implemented within each setting.For example, each site provided project-speci c terms used for the [task-sharing] 'provider', (e.g., 'counselor' in South Africa, 'team member' in Chile, and 'primary health care worker' in Nepal).Notes regarding how each item was translated and all site-speci c adaptations were recorded and discussed during regular biweekly virtual meetings to harmonize the measure across sites and to preserve comprehensiveness of item content (i.e., content validity) to the extent possible.
Pilot testing.
Piloting of the translated and adapted BeFITS-MH measure was conducted concurrently across the three sites with providers (South Africa 4; Chile 5; Nepal 35) and service users (South Africa 10; Chile 5; Nepal 6).As part of the piloting process, cognitive interviews were conducted with respondents who were asked to "think aloud" while responding to each item, and to comment on whether items were worded in an understandable way.We further asked whether items were applicable to the speci c task-sharing mental health program being implemented and the local setting; this enabled identi cation of whether the full range of identi ed barriers and facilitators were used, and in triangulating responses across the three sites, what aspects of barriers and facilitators could be considered core to task-sharing across sites.We also gathered feedback in terms of the project's preference for the format of the measure (question vs. statement) and the scaling used.We discussed the ndings during the biweekly virtual meetings, noting site-speci c ndings as well as commonalities across sites.

Process of Enhancing Utility of the BeFITS-MH measure: Assessing Associations with Implementation Outcomes
To support later BeFITS-MH validation testing, we describe the process of enhancing the construct validity and utility of the BeFITS-MH measure in assessing three implementation outcomes of interest: acceptability, appropriateness, and feasibility.We did this by pilot testing three brief measures that have been previously used in implementation science research (below) and through stakeholder discussions in each site.
Standard measures of implementation outcomes.
The three selected measures were the: a) Acceptability of Intervention Measure (AIM); b) Intervention Appropriateness Measure (IAM); c) and Feasibility of Intervention Measure (FIM) (45).These measures were developed by IS researchers and mental health professionals in the United States, with the vast majority of the developers and the sample of counselors who were part of the development process being Caucasian Americans.The three measures were initially developed for use with mental health counselors in the United States to evaluate the acceptability, appropriateness and feasibility of different treatment options (45) (54).In addition to planning to use these three measures with the task-sharing providers, we explored the potential for using them with the clients and patients who were receiving the task-sharing mental health interventions.Field testing of these three measures with providers and service users was concurrently conducted during the pilot testing of the BeFITS-MH measure (above).After site-speci c translation, we made one adaptation to the measures: replacing the term 'EBP' with the name of the speci c task-sharing program implemented at each site.We then administered the measures to samples of task-sharing providers and service users.
Stakeholder feedback sessions.
To gain a better understanding from the mental health practitioner and system perspectives regarding the implementation outcomes, we held discussions with local staff in each of the three study countries.These individual and small group conversations were led by site co-investigators using a standard script that included de nitions of acceptability, appropriateness, and feasibility, and probes for level-speci c indicators for clients, providers, and the setting (Table 2).By indicators we are referring to individual items or programmatic and clinical metrics (like cases seen per month) that can be included in the measure that are directly related to the measurement of the implementiaton outcomes.After an introduction of the de nitions of the three implementation outcomes, the probes asked the participants to suggest how they think we could best measure these outcomes from the perspective of the people 'receiving' the program, the people 'delivering' the program, and the locale where the program is being provided.What are your ideas for how we could measure whether the place where the program is provided is acceptable?
In Nepal, two small group discussions were held: the rst with 3 participants (1 medical o cer and 2 senior auxiliary health workers) and the second with 7 participants (4 psychosocial counselors and 3 health systems research staff).In South Africa, one small group discussion was held with 3 program staff (program monitoring and evaluation staff and program implementers).In Chile, information was collected by direct interview of 4 mental health professionals (1 psychologist, 1 occupational therapist, 1 social worker, 1 nurse) working at mental health centers where the task-sharing program is being implemented.
The discussions were transcribed and shared in English (for Nepal and Chile) with the full study team.
Transcripts were reviewed and coded for: (I) each of the three implementation outcomes and; (II) each perspective (client, provider, system) by the study PIs (LHY, JB) and key study team members (PTL, MG).
Results were reviewed to identify commonalities and common indicators with a particular eye towards suggesting where differences may be driven by the distinct type of task-sharing program being implemented.Summaries of the stakeholder perspectives around each implementation outcome are presented in the results; recommended programmatic and clinical indicators that can be used for future formal validation testing for BeFITS-MH to enhance its utility are addressed in the Discussion.

RESULTS
We present major developments and ndings of the case study according to our two foci: (I) learnings on the concurrent development of the BeFITS-MH measure in three LMIC settings and (II) learnings on the identi cation and assessment of key implementation outcomes (acceptability, appropriateness, feasibility) to enhance the utility of later BeFITS-MH validity testing.

Process of Developing the BeFITS-MH Measure
The Delphi process resulted in three major adaptations to the BeFITS-MH measure.First, the expert panel agreed that we add a summary or "omnibus" item to each subscale, which is intended to capture the domain's core concept (i.e., construct).If the omnibus items correlate strongly with the other items within each domain and meaningfully predict implementation outcomes, the omnibus items could potentially be used on their own, reducing the length of the measure and increasing its pragmatic utility.Second, we added a domain on stigma.During the Delphi group discussions, several members highlighted the salience of stigma in task-sharing mental health programs in LMICs and the lack of existing measures to capture stigma-related barriers and facilitators.Three study investigators (LHY, BK, PTL) worked with other experts to develop items for the stigma subscale, which included items that assessed: (a) attitudes of the clients, (b) attitudes about the clients, and (c) provider's (stigma) experience.Third, the panel came to a consensus to make some of the items optional, a process that we continued in the following steps (below).This was an effort to enhance contextual relevance and to reduce respondent burden.We recognized that some factors, such as cultural/ethnic/caste backgrounds (Item 4.6), are not relevant in certain projects or settings (below).Additionally, there are constructs that are important to assess in the implementation of the intervention in general but were identi ed as not being speci c to the task-sharing strategy itself; items that fell into this grouping were consolidated into a domain called 'Program Fit'.The study team agreed to make the 'Program Fit' and 'Stigma' domains optional.
Linguistic translation and cultural and contextual adaptation, along with the pilot testing procedures, each of which took place concurrently in the three sites, led to several important adaptations of the BeFITS-MH measure.We highlight three ndings, regarding: (I) localization; (II) scaling and phrasing; and (III) item selection (i.e., rating of items' relevance/applicability by site).
A key element to the BeFITS-MH measure was the project-speci c adaptations (i.e., "localization") of broad terms used to refer to aspects of the task-sharing mental health intervention, such as "program," "provider type," and "type of service."This adaptation was necessary due to the heterogeneous nature of the task-sharing mental health interventions, the type of task-sharing providers employed, and the clients served across the three global sites.We included an introductory statement at the beginning of the measure to situate the respondent to the context of their task-sharing mental health intervention.The terms in square brackets ([ ]) were replaced by project-speci c terms (see The second notable nding of our measure development process was regarding the scaling and the phrasing as questions rather than statements.Although the measure was originally designed as statements, we found that in piloting that phrasing as questions was easier for both providers and clients to understand (e.g., we changed "Clients are satis ed with services…," to "How satis ed are clients with services…?").Study team members in Nepal and South Africa reported that this decreased social desirability bias, thus having less risk of respondents providing a rmative responses across items.In Chile, a high-income country with a 94.6% literacy rate compared to Nepal with a 68% literacy rate, respondents were comfortable with the statement format of items, which they commonly encounter in formal education.However, to make the measure as universally usable as possible, including in LMICs, we decided to use the question format.Based on our biweekly discussions, we then selected a 4-point response scale, which were agreed as easiest to understand and code: 0 = Not at all; 1 = A little; 2 = A moderate amount; 3 = A lot.To support accurate coding, we included three additional options that could be used when assessments were being implemented by assessors (rather than self-report) : 7 = Respondent refused to answer; 8 = Respondent doesn't know; and 9 = Not applicable.
Our third major nding revolved around item selection, or the relevance of items by site.Here we asked respondents whether items were applicable to the speci c task-sharing mental health intervention at hand, thus enabling evaluation of which barriers and facilitators showed relevance, as well as which were shared across sites.To illustrate, Table 4 shows the BeFITS-MH items (provider version) across all six core domains (including optional and omnibus items) rated by applicability (i.e., relevance) to the implemented task-sharing program at each of the three sites.Two main ndings emerged.First, all the required items (and all the optional items except for two, Items 4.4 and 4.5), were rated as "applicable" by at least one site, thus indicating relevance of the vast majority of the identi ed constructs to task-sharing.This nding held true even though sites varied in the number of total items rated as "not applicable" (among sites, Chile rated 2 total items, South Africa rated 6 total items, and Nepal rated 11 total items as "not applicable"; of note, no omnibus item was rated as "not applicable" by any site).Second, common relevance of items across sites identi ed aspects that could be considered core components of barriers and facilitators.This emerged most clearly in Domain 4, "Provider Contextual Congruence", where the task-sharing provider's age (4.1), gender (4.2), being from the same community (4.3) and caste/ethnicity (4.6) were rated as relevant by all sites; conversely, optional items of provider's social status (4.4) and religion (4.5) were rated as not relevant by all sites.These two items were rated as "not applicable" due to the perceived social inappropriateness of commenting upon some personal characteristics of the tasksharing provider that was expressed by respondents in South Africa and Chile.In Nepal, given the overlap of identity markers in this context (e.g., social status [4.4], religion [4.5], and caste/ethnicity [4.6]),only the item assessing provider's caste/ethnicity [4.6] was retained.While items 4.4 and 4.5 were judged as not applicable to our three sites, we retained these items for testing in future locales.Similarly, in Domain 5, "Provider Accessibility and Availability", the ease of talking to (5.1), availability of (5.2), and ease of contacting (5.3) the task-sharing provider were rated as relevant by all sites; conversely, optional items of regularly attending (5.4) and being on time for (5.5) the task-sharing service were rated as "not applicable" by one or more sites.These items were rated as not relevant because there were not different times for task-sharing and "standard" clinical services (i.e., the two were fully integrated) per the tasksharing programs delivered in Nepal and South Africa.

Identi cation and Assessment of Key Implementation Outcomes to Enhance Utility
In the pilot testing of the three standard IS outcome measures (AIM, IAM, FIM),( 45) the provider versions were deemed translatable and comprehensible by the task-sharing providers.However, in the Nepal and South Africa contexts, within each scale many of the items had the same translation in the local languages.For example, for the Intervention Appropriateness Measure (IAM), items of "seems tting", "seems suitable", and "seems like a good match" all had the same terminology in Nepali; similarly, within the Feasibility of Intervention Measure (FIM), the items of "seems implementable", "seems possible", and "seems doable", also were all translated with the exact or very similar wording.Across all sites, the versions of the three IS measures that we adapted for client respondents were deemed repetitive and di cult for service users to respond to.Because service users generally do not have experience with other mental health services, they were unable to compare and contrast their current service or provider with other experiences, and thus many reported that they did not understand how to respond.When asked to compare the different measures, both providers and clients found the tailored nature of the BeFITS-MH items easier to understand and respond to.
From the site-speci c stakeholder discussions to identify indicators and assessment methods for the implementation outcomes (of acceptability, appropriateness, and feasibility) to enhance utility, four common categories of indicators emerged across all three outcomes: (I) uptake/adoption of the tasksharing mental health by client, provider, and facility; (II) effectiveness/impact of the program in terms of the client health outcomes and the capability of providers to deliver more effective and relevant services; (III) ability to design and implement the program with oftentimes limited clinical resources and (IV) stigma-related issues (see Table 6).
The rst set of indicators and assessment methods, evaluation of uptake/adoption at the client level, included indicators such as numbers of referrals, successful initiation of services, completed sessions, and follow up visits.Some stakeholders suggested that uptake/adoption at the provider level could be assessed by measuring factors such as provider's willingness to use and follow the task sharing program, or whether services were provided as intended.At the facility/organization level, an understanding of whether and to what extent the task sharing program has been implemented (e.g., delity) or program components integrated into the organization would indicate program adoption.Notably, these uptake/adoption indicators were listed as ways to assess all three outcomes of acceptability, appropriateness, and feasibility, and by stakeholders across all three study sites.
The second set of indicators and assessment methods revolved around the effectiveness and impact of the task-sharing intervention.Most stakeholders mentioned indicators of program effectiveness for the clients, which included measurements such as improvements in client outcomes (e.g., symptom scores per standardized measures), or users' and providers' perceived 'usefulness' or 'helpfulness' of the speci c task-sharing program in addressing clients' health outcomes and other needs.Some stakeholders also noted assessing implementation outcomes in terms of the impact of the task-sharing intervention on providers' professional development (e.g., the value providers assign to the program as an opportunity to grow professionally and expand their skillsets by providing effective services).
Issues related to resource constraints were identi ed as the third set of indicators and assessment methods, although these factors were most frequently mentioned with regard to feasibility.Stakeholders from all three sites highlighted clinical resource considerations such as having adequate measures, su cient personnel and space, and resources to address patient needs in the context of frequently restricted resources.However, we were unable to ascertain speci c indicators related to the task-sharing program's ability to balance its activities with existing resource constraints.
Finally, stigma-related factors were identi ed as in uential to all three implementation outcomes and to clients, providers, and health systems levels.Stigma was identi ed in relation to issues of con dentiality (e.g., whether facilities had space for con dential information sharing between clients and health providers; and that designated rooms [e.g., for mental health counseling] did not compromise client con dentiality by inadvertently identifying individuals as having a mental health condition).Related to task-sharing speci cally, stakeholders in South Africa noted that they preferred to see task-sharing providers who were referred to generally as "counselors" rather than "mental health counselors", and that peer providers in particular (i.e., persons with the illness [HIV] status themselves who are modeling recovery) were better suited to help patients overcome internalized stigma and effectively address their mental health problems.Several limitations require noting in this case study to develop the BeFITS-MH measure.The tension between developing a measure that can be 'universal' and one that retains 'location speci c' properties was present throughout the process and may prove illustrative for other study teams developing IS measures for global use.This tension was rst exempli ed in the discussion around item format (question vs. statement).Harmonizing across languages and program types resulted in creation of spaces in each question for projects to enter their own program-speci c terminology for the intervention and provider type.On the one hand, this allowed needed project speci city in adapting the measure to t how programs and providers are de ned locally; on the other hand, this may also complicate comparisons between sites where programs and providers differ.In terms of conducting the BeFITS-MH measure piloting and stakeholder discussion eldwork, the COVID-19 pandemic limited the number of assessments that could be completed and in particular limited stakeholder engagement with service users themselves during early stages of the development process.Nevertheless, each study site was able to obtain provider and systems-level stakeholder feedback in the translation and adaptation stages and when obtaining pilot data from providers and service users.
The process presented in this case study was done in part to prepare for the larger BeFITS-MH validation study, in which the BeFITS-MH measure is being embedded in each of the three study site's longitudinal data collection procedures.A persistent challenge during the development and piloting process was the identi cation of appropriate indicators for validity testing.The piloting results raise the challenge of using measures such as the AIM, IAM, and FIM (45).Given that these IS measures had limited comprehensibility and items were often interpreted as redundant, these measures were determined to be not optimal as measures for future construct validity testing.Further, the stakeholder discussions raised challenges in identifying the type and extent of administrative data available within the studies (i.e., uptake data) to operationalize implementation outcomes, and instead emphasized the importance of incorporating preferred indicators that are of clearer utility to program implementers (e.g., uptake/adoption; effectiveness of a task-sharing intervention).Discussions with site co-investigators are ongoing to identify available and appropriate indicators of implementation outcomes to support testing the future predictive validity of the BeFITS-MH measure, and whether other appropriate IS measures exist that could suit our purpose.

Conclusion
A key goal of this case study was to describe the process of developing an IS measure that can be pragmatically useful across multiple diverse global settings with a range of different task-sharing mental health interventions.The challenges that we faced (e.g., identifying accurate terminology for key concepts in each locale, harmonizing translation across sites, identifying appropriate implementation outcomes and indicators for these for validity testing), and the rigorous strategies that we employed to address these, can serve as a rich case description for other implementation research projects.We believe this case study provides a roadmap for other research teams seeking to develop IS measures and locate appropriate measures by which to conduct validity testing, and to those who wish to maximize the local relevance, utility, and impact of their measures while ensuring global applicability.
The development of the BeFITS-MH measure is based on the need to improve identi cation of actionable factors that may enhance or impede uptake of mental health services delivered using task-sharing strategies.Identifying such factors will lead to more appropriately targeted systems-level interventions across settings and task-sharing programs limits the researchers' and implementers' ability in understanding and addressing critical factors of implementation success.Case Study: Process of Developing the BeFITS-MH Measure for Task-Sharing in Mental Health This case study describes the collaborative process of: a) developing and; b) enhancing the utility of the Barriers andFacilitators inImplementation ofTask-Sharing inMentalHealth(BeFITS-MH) measure.The BeFITS-MH measure is intended to be a pragmatic, multi-dimensional, multi-stakeholder measure to help program implementers and researchers assess critical, modi able (i.e., actionable) implementation factors (i.e., 'barriers and facilitators') that affect the acceptability, appropriateness, and feasibility of evidence-based task-sharing mental health interventions.This case study presents the process of developing and piloting the BeFITS-MH measure to aid teaching, practice and research by IS researchers and program implementers.The BeFITS-MH measure is being embedded for validation in task-sharing mental health studies in three global settings: (I) an integrated mental health care package for chronic disorders, including HIV, in South Africa (Southern African Research Consortium for Mental health INTegration [SMhINT]) (31); (II) a team-based, multicomponent approach for rst episode psychosis in Chile (OnTrack Chile [OTCH]) Abbreviations

Table 1
(34)ary of the task-sharing mental health interventions of validation sites for the BeFITS-MH measure This case study begins with our comprehensive process to create a new measure, informed by both IS frameworks and empirical work, to operationalize relevant domains of barriers and facilitators in implementing task-sharing mental health interventions.Rather than referring to a case study research design(34)we use the term "case study" here to refer to a rich narrative description (akin to "case reports" or "case examples" as used in other elds) to provide a real-life example of how to evaluate implementation processes and outcomes in global contexts.Next, we describe the collaborative linguistic, * Integrated mental health package created based on the Reach, Effectiveness, Adoption, Implementation, Maintenance (RE-AIM ) framework and the Consolidated Framework for Implementation Research (CFIR).The package includes mental health literacy of users; training implementation and uptake of mental health screening tool and assessment by primary health care nurses; training lay counsellors in depression counselling; and training and implementation of a community education and detection tool by community health workers at household level.a NYC = New York City; b MPhil = Masters of Philosophy.
. These measures have been used in English-speaking populations across a range of interventions, including with school-staff for student-wellness programming in England and health care providers providing antenatal alcohol reduction interventions in Australia (46, 47).More recently, these measures have been used in LMIC settings (Kenya, Tanzania, Botswana, South Africa, and Guatemala) in studies of mental health interventions (depression, anxiety, and alcohol use disorder) including those utilizing task-sharing strategies, HIV services, and medical interventions for genetic disorders and malignant cancers (48-53).Of note, English language versions of these measures have been used in most settings, with a Swahili version developed in Kenya through translation-back-translation methods

Table 2
Stakeholder Feedback De nitions and Probes Implementation Outcome De nitions of Implementation Outcomes Acceptability This is the view that the program is agreeable and satisfactory to the people providing the program and to the people receiving the program.This means the program is a good t for the individuals providing the program and for the people receiving the program.Appropriateness This is the view that the program ts and is relevant to the setting, to the people providing the program and to the people receiving the program.This means the program is suitable, compatible, a good t for the health issue, and/or given the norms and beliefs of the clinic, the people giving the program, or the people receiving the program.Feasibility This is the view that the program can be successfully used and carried out by the providers in a given setting.This means the program is possible to do given the resources (such as time, effort, and money), and the circumstances (such as policies, timing).

Table 3 )
, allowing for better localization of the task-sharing mental health intervention.The purpose of this survey is to ask you some questions about your experience participating in [PROGRAM], which involves [TYPE OF SERVICE] delivered by [PROVIDER TYPE] to help with [TARGET PROBLEM].

Table 3 a
Localization of key Task Sharing for Mental Health Intervention terms Item #1: The purpose of this survey is to ask you some questions about your experience participating in[program], which involves [type of services] delivered by [provider type] to help with [target problem].

Table 4
Standard and Site-speci c BeFITS-MH Provider Items optional items can be used by implementers if determined to be appropriate for the local context.A summary of the BeFITS-MH domains and examples of each domain's omnibus question is presented in Table 5. (The full BeFITS-MH measure is included in Additional File 1).

Table 5
Note.Example questions are from the provider version of BeFITS-MH.*Indicates omnibus questions.

Table 6
Example Indicators of Implementation Outcomes Strati ed by Main Themes and Level of Measurement provide an illustrative example for researchers and program implementers to identify and address barriers to the initiation, implementation, and sustainability of task-sharing mental health programs across three global contexts.This development process, where we employed a collaborative, multi-country, multi-stakeholder approach, can serve as a valuable case example for other teams developing IS measures, and provide support for considering content validity, contextual relevance (i.e., linguistic, cultural, and contextual adaptation), and pragmatic utility as key factors in the process of developing and enhancing validity for IS measures.In particular, we believe the concurrent adaptation and piloting across programs and global sites with multiple stakeholders from each site contributed novel strategies to standard measurement approaches.A core lesson that emerged is that targeting implementation measures towards actionable domains that could predict pragmatic markers of utility (e.g., effectiveness of an intervention) per program implementers' preferences may generate implementation measures with greater content validity, relevance, and utility.The development of the BeFITS-MH measure was guided by IS frameworks, including the CFIR and TDF, to capture generalizable IS constructs, and developed to be su ciently targeted and brief to support pragmatic and sustainable use in task-sharing programs to support adaptation and quality improvement.Rigorous content validity was established through elucidation of barriers and facilitators to task-sharing mental health concepts using qualitative data from: (I) task-sharing mental health interventions previously conducted in three global sites; and (II) a systematic review, followed by review by an expert Delphi panel.The nal BeFITS-MH domains each include 3-4 individual items and a single omnibus question; once validated, the measure could be as brief as 7 items if only the omnibus questions are used.viewedas relevant, but other aspects such as the provider's social status and religion were considered optional.These results illustrate that while each of our six identi ed BeFITS-MH domains appear universally relevant to task-sharing mental health interventions, the items that make up speci c BeFITS-MH domains can vary by cultural context, what is being done in the task-sharing mental health intervention, and by the nature of the help provided.Finalizing item content while accounting for contextual variations in constructs (and translations) across the three contexts strengthened the measure's potential global relevance and validity.In order to evaluate the ability of the BeFITS-MH measure to accurately assess the key implementation outcomes of acceptability, appropriateness and feasibility during future validity testing, we piloted three frequently used implementation outcome measures (AIM, IAM, and FIM) and identi ed the provider versions as useful, albeit with considerable limitations because of the idiomatic and redundant nature of the terminology when translated into local languages.The client versions were deemed neither comprehensible nor applicable.Because these three measures were designed initially for program implementers and higher-level systems administrators, the limited experience with mental health services of any kind among populations in LMICs, and their lack of familiarity with what alternative mental health services 'should' or 'could' look like, limited the validity and utility of these measures for the client level of measurement.Moreover, most prior studies with these measures in LMICs have been limited to English language versions of the measures and usage only among providers.
In the process of enhancing the future utility of the BeFITS-MH measure, the pilot testing and stakeholder discussions illustrated the perceived overlapping nature of the IS outcomes of acceptability, appropriateness, and feasibility.The stakeholders in particular provided feedback that many indicators and assessments can t across multiple implementation outcomes.For example, indicators of uptake/adoption and effectiveness/impact fall across all three constructs of acceptability, appropriateness, and feasibility.These results suggest that what stakeholders value in terms of signaling useful implementation outcomes may not t the traditional academic approach to treating these implementation outcomes as discrete, thus indicating a potential limitation of relying solely on these IS outcome measures.Instead, conducting concurrent pilot testing and stakeholder analyses, such as what we have done in this study, may result in a validation process and measure that have greater content validity, meaning, and usefulness for program implementers.from all sites simultaneously.Of note, all the required items (and nearly all the optional items) showed relevance to at least one site, indicating that our six identi ed BeFITS-MH domains were useful in assessing barriers and facilitators to task-sharing overall.Further, the BeFITS-MH domain of "Provider Contextual Congruence" showed congruence across all three sites where certain aspects of the tasksharing provider's personal characteristics (i.e., age, gender, being from the same community, and ethnicity) were