In search of environmental risk factors for obsessive-compulsive disorder: Study protocol for the OCDTWIN project

Background The causes of obsessive-compulsive disorder (OCD) remain unknown. Gene-searching efforts are well underway, but the identification of environmental risk factors is at least as important and should be a priority because some of them may be amenable to prevention or early intervention strategies. Genetically informative studies, particularly those employing the discordant monozygotic (MZ) twin design, are ideally suited to study environmental risk factors. This protocol paper describes the study rationale, aims, and methods of OCDTWIN, an open cohort of MZ twin pairs who are discordant for the diagnosis of OCD. Methods OCDTWIN has two broad aims. In Aim 1, we are recruiting MZ twin pairs from across Sweden, conducting thorough clinical assessments, and building a biobank of biological specimens, including blood, saliva, urine, stool, hair, nails, and multimodal brain imaging. A wealth of early life exposures (e.g., perinatal variables, health-related information, psychosocial stressors) are available through linkage with the nationwide registers and the Swedish Twin Registry. Blood spots stored in the Swedish phenylketonuria (PKU) biobank will be available to extract DNA, proteins, and metabolites, providing an invaluable source of biomaterial taken at birth. In Aim 2, we will perform within-pair comparisons of discordant MZ twins, which will allow us to isolate unique environmental risk factors that are in the causal pathway to OCD, while strictly controlling for genetic and early shared environmental influences. To date (May 2023), 43 pairs of twins (21 discordant for OCD) have been recruited. Discussion OCDTWIN hopes to generate unique insights into environmental risk factors that are in the causal pathway to OCD, some of which have the potential of being actionable targets.


Aim 1. Recruitment of MZ twins discordant for OCD and creation of a biobank
Participants and recruitment sources OCDTWIN aims to recruit a minimum of 50 MZ twin pairs (n = 100 unique individuals) discordant for OCD aged 16 years and older (see Power Considerations section below). Control twin pairs without a diagnosis of OCD are available from the ongoing Roots of Autism and Attention-De cit/Hyperactivity Disorder (ADHD) Twin Study in Sweden (RATSS) project [10], which uses largely identical procedures. Finally, we will also recruit twin MZ pairs concordant for OCD diagnosis. By recruiting both discordant and concordant twin pairs, we will be able to appropriately represent the source population with regards to exposure variables and covariate distributions, and account for this in statistical analyses, both in withinpair analyses and standard (between individual) analyses (see General statistical framework for details).
The twins are recruited from different sources. The main source of recruitment is the Swedish Twin Registry, the largest and most comprehensive twin register in the world [7,11]. All the cohorts listed in Table 1 contain validated measures of obsessive-compulsive symptoms, allowing identi cation of MZ pairs potentially concordant or discordant for these symptoms. Furthermore, the Swedish Twin Registry has been linked with the National Patient Register [12], allowing for the identi cation of twins who have been diagnosed with OCD in specialist services across Sweden. Crucially, participants in the Swedish Twin Registry have provided informed consent to be contacted for research purposes. Other sources of recruitment are the Swedish OCD interest organization (Svenska OCD-förbundet) and media advertisements.

Procedures
Potential participants identi ed via any of the sources described above receive a study invitation letter including information about the OCDTWIN project via regular mail. Interested individuals contact the research team. This is followed by a screening phone call to assess eligibility for participation.
Inclusion criteria are: MZ twins, at least one member of the twin pair has a lifetime diagnosis of OCD, both twins consent to participate, are literate in Swedish, and are willing to travel to Stockholm for the assessment. Twins who want to participate in the study but do not wish to travel can still participate in the clinical assessments via telephone and send a subset of biological samples in the post. Information on zygosity is available from the Swedish Twin Registry, but is con rmed by genotyping of saliva or whole-blood derived DNA using a whole-genome covering SNP array [13].
Exclusion criteria include: organic brain disorder, brain injury, epilepsy or an acute mental disorder that may interfere with the evaluation (e.g., psychosis, bipolar disorder). Additional exclusion criteria for the Magnetic Resonance Imaging (MRI) scans include: previous brain surgery, metal implants or medical devices containing metal (e.g., pacemaker), claustrophobia, pregnancy, morbid obesity, or large tattoos.
Twins not currently eligible for the MRI scan can still participate in the remaining assessments and can have the scan at a later stage (e.g., in the case of pregnancy).
Twin pairs meeting inclusion criteria are mailed a questionnaire package and invited for a full testing day in Stockholm. Data acquisition consists of a full day of evaluations, including a detailed clinical interview, neurocognitive testing, physical examination, collection of biological samples, and an MRI scan. Table 2 summarizes the collected information and the instruments used. All biological specimens are deposited at the Karolinska Institutet biobank, according to standard protocols. Genetically-informative studies, in particular those employing the discordant MZ twin design, are ideally suited to test whether the association between an environmental measure (e.g., medical complications at birth) and an observed phenotype (e.g., OCD) is likely to be consistent with a causal effect because they provide excellent control of many potential confounders, including genetic factors and shared environmental in uences. Because MZ twins are genetically identical, and grow up largely in the same environment, any observed phenotypic differences between members of a MZ twin pair (e.g., one twin is affected and the other is not) may be attributable to non-shared environment. In contrast to studies comparing a sample of cases vs. controls (classic comparison, Fig. 1), or even relatives of cases, MZ twins discordant for OCD provide a unique opportunity to isolate environmental risk factors that are unique to each individual, while controlling for a myriad of measured and unmeasured confounders, such as genetic factors, sex, age, parental effects, as well as shared in utero and early life environmental effects (comparison 1, Fig. 1).

General statistical framework
Within-pair differences between affected MZ twins with OCD and their co-twins will be analyzed in a generalized estimating equation (GEE) framework, accounting for dependencies between twins in pairs using cluster-robust standard errors. In what is commonly referred to as co-twin control design, we will examine within-pair associations by analyzing data conditioned on pairs ( xed-effects regression) [14][15][16]. Results from these analyses are automatically adjusted for any confounding factors that are shared between twins in a pair [17], particularly genetic factors, since MZ twins are genetically identical (comparison 1, Fig. 1). Even though our main interest is to identify unique environmental effects, we will also compute a standard association (that is, individuals with OCD vs. controls, regardless of co-twin OCD status) by re-weighting the data by sampling probability and will thus recover the association in the source population, making it possible to identify effects potentially attributable, at least in part, to familial vulnerability (comparison 2, Fig. 1). In addition, all twin pairs, regardless if recruited as concordant or discordant, will contribute to analyses of within-pair associations between other variables of interest than OCD diagnosis, where they may be discordant, such as scores on OCD severity scales.

Register-based data
The Swedish national registers contain administrative records from the entire population prospectively collected over several decades [6]. Data from different registers can be linked by using the personal identi cation number assigned to all Swedish residents at birth or immigration [18]. We will have access to a wide range of early life exposures, such as perinatal and early-life health-related variables, that may have resulted in differentially exposed twins. For twins recruited via the Swedish Twin Registry [7,8], a wealth of prospectively collected data (parent and twin-reported) on environmental exposures are available for analysis. The Child and Adolescent Twin Study in Sweden (CATSS) cohort of the Swedish Twin Registry, where most participants are recruited from, has been described in detail elsewhere [19]. Importantly, the information from the Swedish Twin Registry can be linked to the above-mentioned national registers. For a list of linked registers and examples of available variables, see Table 3. Because we carefully record the date of OCD symptom onset and diagnosis of the affected twins, we will be able to identify exposures that preceded symptom onset. Information on all inpatient and outpatient contacts at all hospitals and specialist centers. Includes primary and supplementary diagnoses based on the International Classi cation of Diseases codes, including somatic (e.g., autoimmune diseases, infections, allergies, respiratory diseases) and psychiatric disorders (e.g., trauma and stress related disorders, mood disorders, substance use disorders).

Prescription Drug
Register [71] Individual-level data for all prescriptions dispensed for in-and outpatients, including type of medication registered using Anatomical Therapeutic Chemical (ATC) Classi cation System codes, dosage, prescription date, prescriber, pharmacy, etc.

Swedish Twin Registry
Variables collected via questionnaires in different waves, including somatic and mental health, personality development, vaccinations, substance use, physical activity or psychosocial adaptation and environment (e.g., traumatic events, school problems, friendships, bullying victimization/perpetration).

Epigenetics -methylation analyses
Current neurobiological models of OCD implicate epigenetic mechanisms in the etiology of OCD [20]. However, the literature is limited. Our design is ideally suited for the identi cation of epigenetic changes, as the genomes of MZ twins are identical, potentially allowing for the observation of changes in the epigenome in absence of genetic variation between twins. DNA methylation analysis, which has been previously studied in neurodevelopmental disorders [21] and in OCD [22], will be used to determine differential methylation in the affected twin sibling, compared to the unaffected co-twin. Genome-wide methylation analysis will be used rst, given the limited evidence of methylation changes in OCD. Second, we will follow the general approach of a previous epigenetic study in OCD [22], which speci cally examined DNA methylation pro les of selected loci that had been associated with OCD in previous genome-wide association studies (GWAS) [23][24][25]. However, previous GWAS of OCD were severely underpowered. Our proposed analyses are timely as the largest GWAS study conducted to date, including approximately 45,000 cases and 30 genome-wide-signi cant loci, is nearing completion. We hypothesize that affected twins will exhibit differential methylation at genes identi ed by this GWAS, compared to their unaffected co-twins. Analyses of genome-wide methylation and methylation pro les of selected genes will be performed using array-based speci c DNA methylation analysis. This array targets > 935,000 CpG sites at single nucleotide resolution, including 99% of RefSeq genes and 96% of CpG islands can be analyzed. Possibly differentially methylated regions will be con rmed by pyrosequencing or nanopore sequencing. As the blood-derived DNA is a mixture of the blood cell type speci c methylation patterns, we will collect information about the cell counts as well as correct bioinformatically if there are any putative differences due to cell populations [26].
As it is a priority of OCDTWIN to identify the epigenetic effects of unique environment while controlling genetic effects, several additional genetic mechanisms will be studied in order to con rm identical genomes in affected and unaffected twins. This includes chromosomal mosaicism, post-zygotic mutation, and mutations of mitochondrial DNA [27]. To assess the landscape of genetic variants among the twins both for somatic and germline, we will use whole genome sequencing (WGS) [28] and highdensity DNA microarrays. DNA microarrays can be used for detection of large CNVs using multiple analysis programs, and the variations found in samples will be compared to control twins, other available controls, and databases to identify the frequency and functionality of the variants identi ed. Furthermore, polygenic risk scores can be calculated and incorporated to all analyses within the OCDTWIN project. WGS can identify rare post-zygotic somatic mutations in the twins. Additionally, rare, damaging variants will be investigated for putative liability variants. Identi ed somatic and selected damaging germline variants will be subject to technological validation by Sanger sequencing or using digital droplet PCR.

Neonatal blood spots
The Swedish PKU biobank [9] contains neonatal blood spots from all children born in Sweden since 1975.
Participating twins consent to the use of these blood spots to extract DNA, proteins, and metabolites, providing an invaluable source of biomaterial taken at birth. In other disorders, important discoveries have been made using neonatal blood spots. For example, persons who develop psychosis have lower levels of certain acute phase proteins (APPs) at the time of birth [29]. APPs are central to innate immune function as well as central nervous system development. Prior studies [31] have demonstrated a high genotyping call rate using whole genome ampli ed DNA from Swedish blood spots collected from 1975 to 2002. Two 3 mm punches from the blood spots are incubated in 200 µl 1x phosphate buffered saline for 2 hours at room temperature on a rotary shaker (900 rpm), yielding an eluate of proteins such as acute phase proteins and antibodies as well as other metabolites (e.g., vitamin D, cytokines, etc.). DNA is then extracted (~ 40-150 ng), only a portion of which (10 ng) is whole genome ampli ed (Repli-g screening kit, Qiagen). The unampli ed DNA retains methylation marks and can be used for epigenetic pro ling and/or CNV validation. The ampli ed DNA can be used for array genotyping, exome sequencing or whole genome sequencing. These analyses will be conducted in collaboration with colleagues at the Statens Serum Institut in Copenhagen, Denmark.
Immunology/in ammation Pediatric Autoimmune Neuropsychiatric Disorder Associated with Streptococcal Infection (PANDAS) can be viewed as an example of a gene-environment interaction leading to OCD [32]. In PANDAS, a relatively common infection appears to represent an environmental stressor that can trigger OCD in a few genetically vulnerable cases. In support of this idea, we have recently reported that while in utero and early life infections are associated with a subsequent risk of OCD, the associations attenuated to the null in sibling models [33]. This suggests that familial or genetic factors explain the association between these early-life infections and OCD. In other words, infections may only trigger obsessive-compulsive symptoms in genetically vulnerable individuals. Through register linkage, we will be able to test whether affected twins are more likely to have had documented infections in early childhood, compared to their unaffected co-twins, in OCD-discordant pairs. In addition, the following markers will be tested in blood: complete blood count (CBC), erythrocyte sedimentation rate (ESR), CRP, TSH, T4, anti-TPO, ferritin, autoantibodies (e.g., transglutaminase-Abs, ANA, Histone-Abs), creatinine, cystatin-C, ALAT, protein fractions, complements, IL-1-β, IL-6, IL-8, IL-10, and TNF-α. In line with our statistical approach, differences between affected and unaffected members of a twin pair will be attributable to disease-state (e.g., response to a chronic illness), whereas differences between affected pairs and healthy control pairs may be interpreted as being potentially attributable to trait immunological or vulnerability factors.

Urinary metabolics and gut microbiota
By comparing urinary metabolics and gut microbiota within and between twin pairs, we aim to explore an additional etiological pathway that has been recently suggested [34]. Using urinary samples, metabolic phenotyping will involve high-resolution proton nuclear magnetic resonance (hydrogen-1 nuclear magnetic resonance; 1H NMR) spectroscopy coupled with mathematical modeling approaches to identify metabolic variation associated with OCD discordance in urine and plasma. Metabolic pro les are measured on a 600 MHz 1H NMR spectrometer using standard one-dimensional NMR experiments optimized for quality, sensitivity, and solvent suppression. Liquid chromatography-mass spectrometry (LCMS) may be applied to extend the metabolic characterization of this sample set. LCMS is a complementary technique to 1H NMR spectroscopy with greater sensitivity and wider metabolome coverage. Using fecal samples, gut microbiota will be investigated, which has emerged as an important functional node within the gut-brain axis [35]. There is increasing interest in the relative potential of the gut microbiota and allied gastrointestinal systems to modulate behavioral functions implicated in psychiatric disorders, including OCD [34]. The determination of gut microbiota will be based on the quanti cation of evolutionary conserved DNA sequences [36]. In microbes, ribosomal RNA (rRNA) genes are transcribed from the ribosomal operon as 30S rRNA precursor molecules and then cleaved by RNaseIII into 16S, 23S, and 5S rRNA molecules. Because 16S rRNA is the most conserved of these three rRNAs, it is often referred to as the "evolutionary clock" and, following ampli cation into 16S rDNA, is highly suitable for the identi cation and classi cation of the entire microbial community present in an environmental entity, such as the gut. The total microbial population in human fecal samples will be determined using two state-of-the-art methods, namely 16S rDNA pyrosequencing and 16S rDNA sequencing.
Brain Individuals with OCD display subtle di culties in neuropsychological tasks of motor and cognitive inhibition, performance monitoring, cognitive exibility, and emotional processing [37]. Consistently, structural and functional neuroimaging studies have found involvement of speci c fronto-striatothalamic and parietal systems in OCD [37], although causal relationships cannot be established. It is unclear whether differences between OCD cases and controls represent pre-existing vulnerabilities that precede the onset of the disorder or are environmentally or behaviorally mediated. In support of the former view, a number of studies have found that individuals with OCD and their unaffected rst-degree relatives share similar cognitive and neural features [e.g., 38, 39,40]. However, as siblings only share about 50% of their genes, it is still unclear whether these ndings re ect genetic vulnerability or environmentally-mediated risk factors. In support of the latter view, variables such as living with a chronic illness are suspected to induce neuroplastic changes in the brain of individuals with OCD [41][42][43][44].
Similarly, medication may represent another unique environmental event affecting brain structure in OCD, as indicated by recent mega-analyses [45,46]. The discordant MZ design is ideally suited to understand what brain ndings may be secondary to environmental exposures, such as use of medication.
MRI data are acquired on a 3T General Electric 750 MR scanner (equipped with a 32-channel head coil) at the MR Research Center at Karolinska Institutet. T1-weighted images are acquired using a high-resolution BRAVO 3D sequence, using the following parameters: TR/TE = 8.2/3.2; 172 slices; FOV: 240; 256x256; 1x0.94x0.94 mm; ip angle = 12 degrees. Voxel-based morphometry analyses will determine whether gray matter volume differences in cingulate cortex and basal ganglia areas observed in previous metaanalyses [47] can be attributed to unique environmental risk.
Diffusion tensor imaging (DTI) measurements of white matter microstructure are acquired using High AngulaR Diffusion Imaging (HARDI) with 60 directions and 61 slices, Dual spin Echo Epi2ks axial; TR/TE: 8000/99; FOV 96x96; 8 b0 images, b-value: 1000 s/mm 2 . Fractional anisotropy (FA) and white matter volume analyses will help determine whether white matter differences observed in previous metaanalyses [48] and mega-analyses [49] are associated with unique environmental risk factors.
Spatial associations between within-pair differences in whole-brain measures and whole-brain gene expression patterns will be explored. The Allen Human Brain Atlas [54] will be used to test for associations between brain structure and connectivity differences (results from within-pair comparisons) and gene expression in a previously described manner [55][56][57] without requiring information from blood or saliva samples, which can however potentially be integrated in subsequent analyses [57]. The spatial similarity between transcriptional pro les of the entire transcriptome atlas and within-pair differences in brain measures observed in our study population will be quanti ed. Histogram distributions of spatial similarity values will reveal genes where the genetic expression pattern is signi cantly associated with brain structure and connectivity maps. Moreover, we will particularly focus on neurogenetic processes by investigating speci c gene ontology (GO) term analysis for "neuro" annotations, as described previously [55,56]. Finally, we will focus on speci c genes strongly suspected to be associated with OCD (such as DLGAP3 [58-61] or NRXN1 [62]) and also new genes uncovered in the latest GWAS.

Power considerations
Given the novelty of the approaches presented, power analyses are necessarily tentative. Although we have a variation in distribution of variables, we have performed a power calculation for continuous, normally distributed variables in a co-twin control design using GEE analytic framework (i.e., xed-effects linear regression) (see Fig. 2). With 50 discordant MZ pairs, we will have approximately 80% power to  [10]. Our main recruitment source, the CATSS cohort within the Swedish Twin Registry [19], is still actively recruiting at a rate of approximately 3,000 new twins per year, providing a sustained source of potential study participants. Data collection will continue for at least the next two years. If we secure additional funds, we aim to continue recruiting participants beyond the planned 50 pairs, thus increasing statistical power. The study is currently limited to Swedish residents and to participants who are 16 years or older. However, we may consider expanding to twin pairs from other countries in the future. There is a risk that some of the younger twins identi ed as unaffected have not had time to develop OCD by the time of their participation, as the youngest participants may be 16 years old. We plan to follow up the twins in the registers to capture any new diagnoses of OCD after they have been recruited to OCDTWIN. Some of the described methods and analysis plans may be obsolete by the time we are ready for data analysis. We are collecting hair and nails but have no speci c plans for analysis at the time of writing. We will closely follow methodological developments in the eld.
While the primary aim of OCDTWIN is the identi cation of environmental risk factors that are in the causal pathway to OCD, we will collect a wide range of exposures from birth (e.g., perinatal complications, birth order, birth weight), childhood (e.g., early infections, bullying and other traumatic experiences), and up to the time of participation in the study (e.g., current medication use). While the interpretation of results regarding early exposures will be relatively straightforward because these exposures will precede the onset of OCD symptoms, the interpretation of results based on more recent exposures will be more challenging. For example, differences between affected and unaffected twins on a given brain measure could be attributable to environmental exposures accumulated during a lifetime, including changes secondary to chronic illness or medication use. Even in this scenario, the results will still be informative because the nature of the design minimizes the in uence of genetic and shared environmental factors, and an association could reveal important, potentially actionable mediators. However, the interpretation of the results will differ according to each speci c exposure and whether temporal precedence can be clearly established.
There are additional challenges associated with the discordant MZ twin design. Our approach assumes that MZ twins are genetically identical. However, post-zygotic mutations are known to occur and can be speci c to one twin in a pair [28], which could explain OCD discordance in some pairs. On the other hand, this will provide a unique opportunity for genetic discovery. Another potential challenge is twin chorionicity, which is often unknown for adult twins. MZ twins can be sub-classi ed according to whether they shared the same placenta or not. For example, in a schizophrenia study, concordant MZ pairs were estimated to be more likely to have shared a single placenta, whereas discordant MZ pairs appeared more likely to have separate placentas [68]. Whether and how post-zygotic mutations and chorionicity can impact the interpretation of our results is unclear but will be considered.
The project is expected to generate many scienti c outputs. All resulting papers will be deposited in preprint repositories (e.g., bioRxiv, PsyArXiv) to ensure immediate access to the scienti c community. We will publish the results in specialized peer-reviewed journals that allow open access formats. Through partnership with other researchers who are collecting similar twin data in other disorders in Sweden, it may be possible to establish which ndings are speci c to OCD or shared with other neuropsychiatric conditions. OCDTWIN will collect nearly the same data as the RATSS study [10], which focuses on autism and ADHD. Similarly, the ongoing CREAT (Comprehensive Risk Evaluation for Anorexia nervosa in Twins) study focuses on MZ twins who are discordant for anorexia nervosa [69]. Both these cohorts will provide additional opportunities for collaboration.

Availability of data and materials
Due to European legislation, the data will not be deposited in public repositories. However, the data can be made available to the international research community for formal collaborations upon reasonable request and adequate data transfer agreements that comply with Swedish and European law.

Competing interests
Prof Mataix-Cols receives royalties for contributing articles to UpToDate, Inc, and is part owner of a digital health company called Scandinavian E-Health, AB, outside the submitted work. Dr Lorena Fernández de la Cruz receives royalties for contributing articles to UpToDate, Wolters Kluwer Health and for editorial work for Elsevier, all outside the submitted work. Prof Bölte Bölte discloses that he has in the last 3 years acted as an author, consultant or lecturer for Medice and Roche. He receives royalties for textbooks and diagnostic tools from Hogrefe and Liber. Bölte is partner in SB Education/Psychological Consulting AB and NeuroSupportSolutions International AB, all outside the submitted work. All other authors report no potential con icts of interest. Author's contributions All authors obtained funding and conceived/designed the study. DMC wrote the rst draft. All authors critically reviewed the manuscript and approved the nal version. Study aims and design rationale.
Footnote: ODCTWIN primarily aims to recruit a cohort of MZ twins who are discordant for OCD (AIM 1) and identify environmental risk factors that are in the causal pathway to OCD (AIM 2). The results of traditional case-control designs are di cult to interpret because they are unable to effectively control for familial confounders (AIM 2, classic comparison). MZ twins discordant for OCD provide a unique opportunity to isolate environmental risk factors that are unique to each individual, while controlling for measured and unmeasured confounders, including shared genetic factors, and early life environmental effects (AIM 2, comparison 1). Even though our main interest is to identify unique environmental effects, we will also compare affected twin pairs (where at least one twin has OCD) with unaffected twin pairs to identify effects that can be attributed, at least in part, to familial vulnerability (AIM 2, comparison 2).

Figure 2
Number of discordant twin pairs needed for power per different effect sizes. "d" refers to average difference, measured in standard deviations, between affected and unaffected twins, similar to Cohen's d.