Neurodesk: An accessible, flexible, and portable data analysis environment for reproducible neuroimaging

Neuroimaging data analysis often requires purpose-built software, which can be challenging to install and may produce different results across computing environments. Beyond being a roadblock to neuroscientists, these issues of accessibility and portability can hamper the reproducibility of neuroimaging data analysis pipelines. Here, we introduce the Neurodesk platform, which harnesses software containers to support a comprehensive and growing suite of neuroimaging software (https://www.neurodesk.org/). Neurodesk includes a browser-accessible virtual desktop environment and a command line interface, mediating access to containerized neuroimaging software libraries on various computing platforms, including personal and high-performance computers, cloud computing and Jupyter Notebooks. This community-oriented, open-source platform enables a paradigm shift for neuroimaging data analysis, allowing for accessible, flexible, fully reproducible, and portable data analysis pipelines.


Introduction 134
Neuroimaging data analysis is a challenging enterprise. Aside from the 135 neuroscientific principles motivating the choice of analysis, building an analysis pipeline 136 requires advanced domain knowledge well beyond the researcher's topic area; for example, 137 signal and image processing, computer science, software engineering, statistics, machine 138 learning, and applied physics. Researchers faced with this daunting task rely on multiple 139 specialized software packages used in custom pipelines to suit a specific aim. Researchers Ideally, the software and code used in any scientific analysis workflow should be easily 150 accessible so that users can deploy the workflow without a substantial investment of time or 151 effort 5 . It should be portable so that analysis workflows can be tractably shifted between 152 operating system versions and computing environments and deliver identical results. Many 153 researchers prototype analysis pipelines using their local computers and later switch to 154 workstations and high-performance computing (HPC) clusters for processing datasets at 155 scale. Accessible and portable workflows allow for an optimized allocation of computing 156 resources while supporting shared development workloads amongst collaborators 6 . 157 Unfortunately, many neuroimaging data analysis workflows are currently neither readily 158 accessible nor portable 7-9 because they rely on specialized tools purpose-built by a small 159 number of developers 2 . 160 Beyond the productivity costs, the inaccessibility and instability of many 161 neuroimaging tools pose a wider threat to reproducibility 10-17 with reproducibility defined as 162 "running the same software on the same input data and obtaining the same result" 16,18,19 . The 163 transparency and openness promotion (TOP) guidelines, which have over 5,000 journals and organizations as signatories, state that all reported results should be independently 165 reproduced before publication 20 . However, this is impractical and too time-consuming to 166 implement at review 8 . Where analysis pipelines are ported, subtle differences in the 167 implementation of specific processing steps and software versions across computing 168 environments can systematically affect results [21][22][23][24] . Thus, it is often impossible to reproduce 169 a prior study's results, even given the original data and analysis protocol 14,21 . Controlling the 170 specific software version of a tool and its dependencies is key to reproducibility 25   to create an accessible and portable data analysis environment that allows users to flexibly 202 and reproducibly access a comprehensive range of neuroimaging tools. 203

Overview of the Neurodesk Platform 205
Here, we present Neurodesk, a platform facilitating Accessibility, Portability, 206 Reproducibility, and Flexibility for Neuroimaging data analysis (Figure 1). In developing 207 Neurodesk, we ensured that workflows developed on the Neurodesk platform remained 208 consistent with these four guiding principles across updates to users' local computing 209 environments. In this section, we introduce the available tools in the Neurodesk platform, 210 discuss how this addresses the issues raised above and report the results of an empirical 211 evaluation of reproducibility in Neurodesk. For further details of the rationale behind the 212 approaches adopted to achieve these results, please see the online methods. 213 At the core of Neurodesk are Neurocontainers, a collection of software containers that 214 package a comprehensive and growing array of versioned neuroimaging tools (Figure 1b). 215 The community contributes recipes based on the open-source project Neurodocker 44 , a 216 continuous integration system builds the containers and uploads them to a container registry 217 (Figure 1a). Each 'Neurocontainer' includes the packaged tool and all dependencies required 218 to execute that tool, allowing it to run on various computing systems (Figure 1c). Because the 219 containers isolate dependencies, different Neurocontainers can provide different versions of 220 the same tool without conflicts. This mechanism allows researchers to seamlessly transition 221 between different software versions across projects or within a single analysis pipeline. A 222 newly developed accessibility layer enables researchers to use software directly through the 223 cloud or download containers for offline use without the need to install software on a local 224 system (Figure 1b). 225 Neurodesk: An accessible platform for reproducible neuroimaging   8   There are two options for interfacing with Neurocontainers: The first is  226 Neurodesktop, a remote desktop and browser-accessible virtual desktop environment that 227 can launch any of the containerized tools from the application menu (Figure 1d). Analyzing 228 neuroimaging data through Neurodesktop has the look and feel of working on one's local 229 computer. For more advanced users and HPC environments, Neurocommand enables 230 interfacing with Neurocontainers through the command line (Figure 1d). These interfaces 231 can be deployed across almost any computing hardware and modern operating system, 232 meaning that analysis pipelines developed using the Neurodesk platform are reproducible 233 and can range from local computers to cloud and HPC environments.

Long-Term Sustainability of the Neurodesk Platform 287
Neurodesk has a wide selection of tools available spanning many domains of 288 neuroimaging data analysis. Table 1 shows the tools available at the time of publication, 289 though this list is growing rapidly. Users can find a full and up-to-date list at 290 https://Neurodesk.org/applications/. Neurodesk employs a two-pronged approach to staying 291 up-to-date with new neuroimaging tools and new versions of already included software: a.) 292 The Neurodesk maintainers add tools as they become aware of new developments or 293 community members request the addition of new packages. The Neurodesk GitHub 294 repository (https://github.com/NeuroDesk) has an active discussion forum where developers 295 respond to requests for new software containers. b.) In addition to this developer-centric 296 route to new software containers, we actively encourage contributions from the research 297 community. A core aim for developing the Neurodesk platform was to build a community-298 driven project that is not contingent on a specific team of developers. As such, we provide a 299 template and detailed instructions for creating build scripts for new software containers. 300 301

Reproducibility in Neurodesk 306
Scientific progress fundamentally depends on the peer review process -scientists must 307 be able to critically assess reported findings and conclusions based on a clear and thorough 308 methodological description 18 . Well-documented experimental code is the most thorough 309 description of any analysis pipeline. However, differences in computing environments and 310 dependencies mean that access to this source code does not guarantee the capability to run 311 the code nor the same result 19,105 . Reproducibility has therefore come to represent a minimum 312 standard by which to judge scientific claims 16,18,19 . Unfortunately, scientific reproducibility is 313 often not attainable due to differences in the outcomes of neuroimaging pipelines across 314 different computing environments, as previously documented 21 Each analysis was run twice within each environment to verify that there was no intra-333 environment variability. To evaluate the reproducibility of the analysis environment using 334 locally installed vs. Neurodesk software, we compared the outputs for each installation type 335 across computers (System A vs. System B). For intra-and inter-environment comparisons, we 336 first compared file checksums. When two files produced different checksums, we quantified 337 the pairwise differences across systems by computing Dice dissimilarity coefficients across 338 images (Figure 2a). Note that there were never any intra-system differences in checksums 339 (i.e., all analyses were deterministic, resulting in identical outcomes when run twice in the 340 same computing environment). The code used to implement these analyses is available and 341 re-executable through Neurodesk Play at: https://osf.io/e6pw3/. 342  were widespread yet subtle. In line with Glatard et. al's approach, we next asked whether 383 these differences impacted subcortical tissue segmentation (using FSL FIRST); the next step 384 in the analysis pipeline. File checksums for the segmentation outputs matched for 0% of 385 images when run using the local installation and for 93% of images when run with Neurodesk. 386 Computation of the Dice dissimilarity coefficients for each type of installation revealed that 387 while differences were small, they had non-overlapping ranges. Indeed, differences were 388 much less prevalent for the Neurodesk installations (Dice dissimilarity coefficient; Range: 389 0.00 − 2.20x10 -5 , M = 3.43x10 -7 , SD < 0.01) compared with the local installations (Dice 390 dissimilarity coefficient; Range: 5.80x10 -5 − 4.59x10 -4 , M = 1.46x10 -4 , SD < 0.01, Figure 2c). On 391 average, there were 426 times more voxel-wise disagreements across systems for the locally 392 installed software than for Neurodesk. This difference can be visualized by comparing the 393 3D projections of the mean inter-system differences in classification across participants 394 (Figure 3c, d). These projections illustrate that differences for locally installed software were widespread across all subcortical structures (Figure 3c), while any subtle differences for 396 Neurodesk were limited to a few voxels (Figure 3d). Scatter plot showing the mean inter-system image intensity differences across all voxels within the 407 classified subcortical structures vs. the number of voxels subsequently classified with different labels 408 across systems. For analyses performed with locally installed software, participants with larger 409 differences in image intensity typically also had more prolific disagreement in labels between systems 410 (Pearson's r = 0.608, p < 0.001). This trend could not be assessed for Neurodesk, as there were no 411 differences in image intensity across systems. 412

Understanding inter-system differences in image registration and tissue classification. 413
Differences in tissue classification were at least partially attributable to differences in 414 registered image intensity earlier in the pipeline. Indeed, there was a strong positive 415 correlation between the magnitude of each participant's inter-system differences in 416 registered image intensity and inter-system classification mismatches (Pearson's r = 0.608, p 417 < .001, Figure 3e). Thus, larger inter-system differences after the FSL FLIRT analysis were 418 associated with larger inter-system differences after the subsequent FSL FIRST analysis. Neurodesk can also aid research software developers wishing to make their tools more 508 accessible. The effort to containerize and add one's software to Neurodesk may be minimal 509 compared to the burden of testing across multiple computing platforms and fielding support 510 queries from end-users running software in diverse environments.
Neurodesk currently has limitations that warrant discussion. The first limitation is 512 that the software containers in the Neurodesk platform currently do not support the ARM 513 CPU architecture, which will become increasingly common as Mac users update their 514 hardware. This stems from limitations in the underlying software applications, which 515 currently need more support for this processor architecture. However, tool developers are 516 rapidly adapting tools for this architecture, and we are convinced that this problem will be 517 addressed for the most used applications in the future. Further limitations may arise as 518 Neurodesk is applied across more diverse use-cases by the broader research community. A 519 pertinent example relates to the use of proprietary and licensed software. This is an area of 520 active development as the Neurodesk community investigates how to integrate such software 521 without compromising the accessibility principle. A strength of Neurodesk is that the 522 community-oriented, continuous integration model provides a powerful and flexible way to 523 address such expanded use-cases without depending on a single development team. This 524 relates to a potential limitation of any such platform -the project's long-term sustainability. 525 The Neurodesk platform was funded to be sustainable and supported by the community, but 526 for this to be successful, the project needs constant maintenance. We, therefore, developed 527 multiple pathways for sustainability, including the federated support of the underlying 528 hosting infrastructure, flexibility in the continuous integration and deployment 529 infrastructure, and a potential for a commercial model to offer tailored support for 530 institutions and workshops. 531 The challenges to accessibility and reproducibility posed by neuroimaging data 532 analysis software are not unique to neuroscience. While we have chosen to containerize 533 software designed for neuroimaging datasets, the principles governing the design of the 534 Neurodesk platform can be unrestricted to this field of research. This open-source platform 535 could be used to deploy software specific to any other discipline, and it is our sincere hope 536 that this platform is adapted to other disciplines struggling with similar issues. The 537 Neurodesk platform has the potential to improve the way scientists analyze data and 538 communicate results profoundly. For the first time, this platform allows any scientist, 539 anywhere in the world, to conveniently access their data analysis tools and apply them in a 540 fully reproducible manner from any computing environment. We are excited to see what new 541 insights such technology can enable. 542

Online methods 543
Neurodesk's open-access code and documentation 544 All stages of development, from the initial conception as a hackathon project, through to 545 the most current iteration of Neurodesk, with up-to-date community-built Neurocontainer 546 recipes, are documented publicly: 547 https://www.neurodesk.org/ -Platform website which includes 'Getting Started' tutorials for 548 new users of various skill levels. 549 https://github.com/NeuroDesk -Public GitHub repository, where Issues can be logged, and 550 contributions can be made by any community member with a GitHub account and the 551 eagerness to create pull requests. 552

Data Availability 553
The data that support the findings of this study are available from the International 554 University -Germany. There are restrictions that apply to the availability of these data, 560 which were used under approved permission for the current study, and so are not publicly 561 available, but available from ICBM upon request. 562

Code Availability 563
The code for this study is available on the GitHub repository at 564 https://github.com/NeuroDesk with no restrictions on access. The code is licensed under 565 the MIT License. 566

Neurodesk? 570
We provide a Jupyter Notebook to showcase how different tools can be used in a fully 571 reproducible and shareable analysis pipeline: https://github.com/NeuroDesk/example-572 notebooks/blob/main/nipype_module_example.ipynb. In this example, we demonstrate the 573 use of FSL and AFNI on a publicly available dataset. We used the open-source nipype 574 workflow system to execute analyses on this data, enabling complex analyses to be built, 575 shared, and executed identically in another Neurodesk installation. 576

Will running my analyses on Neurodesk be slower than if they were run locally, especially if I'm 577 on a slower internet connection? 578
The internet bandwidth will only affect your analysis speed the first time you use a 579 new tool. Neurodesk uses the CernVM File System (CVMFS), meaning that only the 580 specific part of a currently used container will be downloaded over the internet. Once 581 downloaded, these will be cached locally, meaning that software will operate at the same 582 speed as it would when running locally (see table S1). Although there is a container 583 initialization time that could impact performance in comparison to a non-containerized 584 workflow, there is evidence that in some cases, containerized analysis pipelines may run 585 even faster than locally installed software due to efficiency gains in accessing files 113 . 586 Where are Neurodesk containers stored, and will the performance differ from country to 587 country? 588 Neurodesk containers are distributed globally via CVMFS and accessed from the 589 fastest server according to your location. We aim to get mirror servers as close as possible 590 to all users so that CVMFS can automatically use the fastest available mirror server. 591

Are there any security concerns regarding using the Neurodesk platform in a web browser? For 592 example, could there be any risks that compromise data processed on Neurodesk? 593
The underlying container technology in Neurodesk ensures that applications are 594 isolated with the least privileges to minimize the impact of malicious software. Interacting 595 with the web from within a Neurodesktop poses a similar risk to any system with access to 596 the internet, so all precautions would apply. Neurodesktop can be shut down, deleted and 597 started fresh with minimal effort, which means recovery is significantly simpler than a 598 native installation in a similar scenario. To ensure data security, it is essential for users who 599 run Neurodesk on a cloud provider or in their local network to follow security best practices 600 and secure the port Neurodesktop is running on via firewall rules. For an in-depth review of 601 the potential security concerns of containerizing scientific data analysis software, see Kaur due its cross-platform support and ability to run singularity containers within. Singularity, 627 which is used for the individual application containers (Neurocontainers), is preferred by 628 most high-performance computing (HPC) platforms, where multi-user security and 629 scheduling are of particular concern and can also be used indirectly via wrapper scripts and 630 lmod; a system which manages environment configurations for different software packages. 631 Are there any financial costs associated with keeping Neurodesk running, and if so, how will 632 these be met for the foreseeable future? 633 The long-term sustainability of Neurodesk has been planned according to three Neurodesk is open-source, such that anyone is able to contribute containerized software to the 650

platform. Are there any protocols in place to verify that this software is working as expected 651 before it is made available to the community? 652
There is a feature to include a functional test within each tool's container. This test 653 can be run automatically after each container is built. However, such automated tests can 654 only cover a subset of potential problems and we also rely on issues reported by users on 655 GitHub and manual testing of new containers when releasing new versions.

added? 659
Users can submit a GitHub issue to request new tools by providing the following 660 information: name and version of the tool, preferred Linux distribution, Linux commands 661 for installing the tool, any GUI applications and commands to start them, test data to 662 include, reference to paper, link to documentation, and commands for testing the tool. 663

How do I get help if I encounter an issue with Neurodesk? 664
There is an active discussion forum on GitHub with a Q&A section. If your question 665 has not already been addressed there, please raise a new issue. 666

Reproducibility in Neurodesk 667
To investigate our claims that the Neurodesk platform's containerized tools lead to 668 more reproducible results than locally installed software, we sought to conceptually 669 replicate the results reported by Glatard et al. (2015) using Neurodesk vs locally installed 670 software across different operating systems. The first steps in Glatard et al.'s analysis 671 pipeline were brain extraction and tissue classification. 672 Brain extraction and tissue classification. FSL BET and FAST were run on raw MRI 673 images to extract voxels containing brain tissue and classify tissue types, respectively. The 674 file checksums for the outputs of these processing steps were identical across all computing 675 environments, verifying that the implementation of the processing pipeline was 676 reproducible across systems for both Neurodesk and local installation. After these steps, 677 image registration and tissue classification were performed with FSL-FLIRT and FSL-678 FIRST, respectively. These analysis steps did lead to differences in results across systems, 679 and are thus reported in the main text. 680

Understanding inter-system differences in image registration and tissue classification. 681
Given that the image registration and tissue classification steps led to inter-system 682 differences, we sought to understand the cause of these differences. FSL utilizes dynamic 683 linking to shared system libraries such as libmath and LAPACK, which are loaded at 684 runtime. Thus, while the same version of FSL was installed in all four computing 685 environments, differences in image processing still emerge for analyses run on locally installed software. This is due to differences in dependencies across systems, a problem 687 addressed by Neurodesk. To better understand how such differences might emerge, calls to 688 these libraries were recorded for a representative image using 'ltrace'. The libraries called 689 during the FLIRT and FIRST analyses could be categorized into four main classes: 690 mathematical operations, matrix operations, memory allocation, and system operations. 691 Interestingly, Glatard et al., who used older software versions than we investigated here, 692 found that image processing differences across systems resulted largely from differences in 693 floating point representation in the mathematical functions expf(), cosf(), and sinf(). They also 694 found inter-system differences in the control-flow of the programs, indicated by differences 695 in the number of library calls to mathematical functions such as floorf(). Here, differences in 696 floating point representation were less severe, as these were only present for the sinf() 697 function. However, the number of calls made to several functions differed across the local 698 FSL installations, indicating that the inter-system differences in the control flow of the 699 processing pipeline remain an issue for reproducibility ( Table S1). The floorf() function 700 represented the most prevalent difference in library calls. There were over 13 000 additional 701 calls to this function made on System B relative to System A for the FLIRT analysis, and 702 approximately 5.5 million additional calls for the FIRST analysis. Overall, the FIRST 703 analysis had greater discrepancy in calls overall. After accounting for the additional calls to 704 floorf(), which occurred early in the FIRST analysis pipeline, mismatches in the sequence of 705 system calls to several other functions remained (Figure 4a). However, all remaining 706 mismatches across systems occurred in memory allocation functions. Importantly, there 707 was no difference in floating point representation or the number of system calls to shared 708 libraries across systems for the Neurodesk implementation of FSL (Figure 4b), while 709 maintaining a similar runtime as local installation on the same hardware (Table S1).

718
Understanding the practical implications of inter-system differences. The local installations led 719 to inter-system differences in tissue classification orders of magnitude larger than in 720 Neurodesk. However, it is difficult to know how voxel-wise differences of this scale might 721 actually affect test statistics i.e. could there actually be a different conclusion about the 722 research question if the same analysis on the same data runs on a different computer? To 723 address these questions, we performed a permutation test to examine the impact of inter-724 system differences in tissue classification (using FSL FIRST) on correlations between 725 subcortical structure volumes and age. 726 On each system (A,B), for both Neurodesk and local installations, we computed the volume 727 of each subcortical structure in the left hemisphere, right hemisphere, and the whole 728 structure by participant. We performed permutation tests for each of these volumes (9999 729 permutations each). On each permutation, we performed a Pearson correlation of volume vs. participant age, and calculated the differences in the values of the correlation 731 coefficients across the two systems. These permutation tests were repeated for three 732 different sample sizes (n=10, 30, 50), such that each permutation for each sample size 733 represented a different randomly selected group of participants. Critically, for each sample-734 wise permutation, the same sample was used for each of the two systems, such that the test-735 statistic difference always represented inter-system differences rather than inter-sample 736 differences. Thus, the distribution of test statistic differences for each sample size 737 represents 209979 permuted samples (7 subcortical structures (Putamen, Amygdala, 738 Thalamus, Pallidum, Caudate Nucleus, Hippocampus, Accumbens.) x 3 methods (left 739 hemisphere, right hemisphere, both) x 9999 subject-wise permutations). 740 The analysis showed that as sample size decreased, the inter-system coefficient differences 741 for the local installations increased in magnitude (Local installation: N=50, Δr = -0.02 − 0.02 | 742 N=30, Δr = -0.04 − 0.03 | N=10, Δr = -0.08 − 0.11; Figure S1). By contrast, the inter-system test 743 statistic differences for Neurodesk were negligible and did not scale with sample size 744 (Neurodesk: N=50, Δr = -1.74x10 -3 − 2.59x10 -4 | N=30, Δr = -3.75x10 -5 − 1.89x10 -4 | N=10, Δr = -745 1.52x10 -3 − 0; Figure S1). Thus, the minor differences in image processing with locally 746 installed software can meaningfully impact the reliability of test statistics, especially when 747 statistical power is already low. It is therefore crucial to consider both sample variability 748 and system variability when conducting these types of analyses. Figure S1. Permutation test results showing inter-system differences in r-values for the correlation 751 between age and volume of subcortical structures, organized by sample size (n = 10, 30, 50). 752