Seamless Data Analysis, Visualizations and Sharing: Perspective from IRRI’s Rainfed Rice Breeding Program

Phenotypic data analysis is a key component in crop breeding to extract meaningful insights from data in making better breeding decisions. Each year the rainfed rice breeding (RRB) program at IRRI conducts trials in the national agricultural research and extension systems (NARES) network-partner sites across South Asia, Southeast Asia and Africa. Analyzing the data from the network trials and sharing the results with the partners in the best possible format is a daunting task. It is crucial to demystify data analysis to the NARES partners for making better breeding decisions. Here, we provide an overview of how RRB program at IRRI has leveraged R computational power with open-source resource tools like R Markdown, plotly , LaTeX and HTML to develop a unique data analysis workow and redesigned it to a reproducible document for better interpretation, visualization and seamlessly sharing with partners. The generated report is the state-of-the-art implementation of analysis workow and outputs either in text, tables or graphics in a unied way as one document. The analysis is highly reproducible and can be regenerated based at any time. The plots are built with enhanced dynamic and interactive visualizations to aid in better understanding and extract information with ease. Tables are highly interactive and manageable rendering liberty to be exported within the document in numerous formats. The source code and demo data set for download and use is available at https://github.com/whussain2/Analysis-pipeline . Conclusively, the analysis workow and document we presented is not limited to IRRI’s RRB program but is applicable to any organization or institute with full-edged breeding programs.

with the partners in the best possible format is a daunting task. It is crucial to demystify data analysis to the NARES partners for making better breeding decisions. Here, we provide an overview of how RRB program at IRRI has leveraged R computational power with open-source resource tools like R Markdown, plotly , LaTeX and HTML to develop a unique data analysis work ow and redesigned it to a reproducible document for better interpretation, visualization and seamlessly sharing with partners. The generated report is the state-of-the-art implementation of analysis work ow and outputs either in text, tables or graphics in a uni ed way as one document. The analysis is highly reproducible and can be regenerated based at any time. The plots are built with enhanced dynamic and interactive visualizations to aid in better understanding and extract information with ease. Tables are highly interactive and manageable rendering liberty to be exported within the document in numerous formats. The source code and demo data set for download and use is available at https://github.com/whussain2/Analysis-pipeline . Conclusively, the analysis work ow and document we presented is not limited to IRRI's RRB program but is applicable to any organization or institute with full-edged breeding programs.

Background
The International Rice Research Institute (IRRI), established in the 1960s, is the world's premier research organization dedicated to rice science. RRB at IRRI started since the establishment of the institute and is continuously committed to innovate and develop improved rice germplasm for improving livelihood of farmers encountering challenging climates (Dar et al. 2020). Currently, the ongoing rice breeding project, "Accelerated Genetic Gains in Rice Alliance" at IRRI funded by the Bill and Melinda Gates Foundation (BMGF) is mandated to modernize breeding strategies and framework to increase the current rates of genetic gains in close collaboration with NARES network-partner's across South Asia (India, Bangladesh, and Nepal), East and Southern Africa (Kenya, Mozambique, Tanzania and Burundi).
Every year rainfed breeding programs at IRRI shares the breeding germplasm tolerant to drought, salt, heat and submergence with the regional partner's for phenotypic evaluation and in return receives raw data from several trials at different locations. For instance, the rainfed breeding program during the year 2019 received data from approximately 20 trials from the NARES partner's. It is crucial to demystify data analysis for regional partner's to make better breeding decisions and present the results in an easy and understandable format. Explicit documentation will contribute to a crystal clear interpretation and understanding of results along with promoting collaborations. Furthermore, simultaneously analyzing and documenting the results has not been possible with readily available computational tools that require a 'copy and paste' system to document or report the results which in turn are highly error-prone. Thus, we believe an immediate up-gradation of data analysis work ow is crucial to be more effective and would enhance reproducibility (Beaulieu-Jones and Greene 2017). The high-end improvement is necessary for conveniently documenting and sharing the reports.
Technology advances have made data management, analysis, interpretation, visualization and sharing more convenient. For example, R software (R Core Team 2018) packages viz., ggplot2 (Wickham 2016), plotly (https://plotly.com/),, DT (https://rstudio.github.io/DT/) has made the data mining manageable and visualizations interactive and dynamic. Similarly with R Markdown (Baumer and Udwin 2015), data analysis can be turned into high quality reproducible reports in which codes, text, tables, graphics and more are embedded in one uni ed document. Furthermore, the reports can be generated in a variety of formats including MS Word, PDF, HTML (Hyper-Text Markup Language) and more for seamless sharing (https://rmarkdown.rstudio.com/)..
Here, we provide an overview of how RRB program at IRRI has leveraged R computational power with open-source resource tools of R Markdown, plotly, LaTeX (Triantafyllidis and Papageorgiou 2018) (https://www.latex-project.org/get/) and HTML to develop a unique work ow of phenotypic data analysis and redesigning it to a reproducible document for better interpretations, visualization and easy sharing with collaborators. The analytical pipeline we report is unique because the data analysis work ow with modern methodology, description of scripts, outputs (either in text, tables or graphics) and interpretation of results are compiled as a single document or in simpler words 'everything is at one place'.

Main Features Of Analysis Report
The phenotypic data analysis work ow we developed is very unique, having an analysis work ow, description of scripts results with detailed description embedded as one uni ed document. A sample document is given in Additional le 1. The main features of the document are: 1. The report shows a complete data analysis work ow and a modern way of analyzing the data (Fig.   1). Data modelling includes advanced mixed-models to account both for experimental designs and spatial variations (Isik et al. 2017). Five-mixed-models are tested and the best model is selected for downstream analysis and generation of results. The description of the models is detailed in Additional le 1 under sections 4.1 and 4.2.
2. The analysis pipeline and document is highly reproducible and the same report and analysis pipeline can be generated if needed. The sample source codes of the analytical pipeline and demo data set can be directly downloaded from the GitHub repository (https://github.com/whussain2/Analysispipeline). The instructions on how to run the analysis pipeline on a local computer is given on the GitHub repository page.
3. Any new data and or editing/corrections to the existing pipeline can be done by simply re-knit the R markdown '.Rmd' document (https://rmarkdown.rstudio.com/articles_intro.html). This analytical pipeline avoids manually updating or generating reports or PowerPoint slides which are otherwise highly prone to errors and time-consuming.
4. The document includes metadata (information about the eld trial, data collection, experimental design, and more) in the beginning for quick identi cation, location and association of data and analysis at any given time (Fig. 2a).
5. The document is well structured and organized. For example, the document is divided into sections with headings and subheadings to increase accessibility and cognition. The table of contents is always visible in the document making it faster and easier to navigate within a document (Fig. 2a). Additionally, readers have the exibility to hide the sections for better readability and accessibility.
6. The document is currently generated in HTML which upon download can be easily opened in any browser without requiring any access to the internet. Further, HTML les can be shared easily and/or hosted on websites for easy sharing and future use.
7. The graphics in the document are highly dynamic and interactive. Simply hovering a cursor on the plot will display the additional and hidden information, which is not possible in static plots. For example, the box plots and heatmaps of eld experimental design to visualize spatial trends are highly dynamic and interactive ( Fig. 2b and 2c). Additionally, plots can be easily exported to the local drive.
8. The output generated in the form of tables is highly dynamic and interactive. Tables generated can be easily managed, searched, and sorted like a mini excel sheet (Fig 2d). Interestingly, tables can also be exported in various formats or printed directly within the document. The tables and result outputs being in the same le completely avoids the option of saving the les on computers and digging into them to extract the useful information in making presentations or in undertaking breeding decisions.
9. Complete description and details of scripts, procedures and methods used for analysis are elaborated in the same document. Results generated in the document in the form of plain plots and or tables have been thoroughly described to aid in the interpretation and better understanding. Hyperlinks have been embedded in the required sections to help in understanding the concepts and add knowledge to the users. For example, web sources on how to interpret the box plots; methods used to calculate heritability with complex models; spatial analysis modelling and much more has been hyperlinked in the document ( Fig. 2e and 2f).

Conclusion
We report unique implementation of data analyses work ow and methodology, and document it for better understanding and easy sharing with partners. The enhanced interactive visualization of plots and tables makes it easier for users to extract the more information available at ngertips with ease. To fully harness the bene ts of this approach basic knowledge of R programming and R Markdown is required. We believe this is a great initiative to modernize the data analysis of IRRI's RRB program and in the future can be further improved. Conclusively, the analysis pipeline will be of great use to the crop breeding communities having full-edged breeding programs.

Availability of supporting data
The R source codes and demo data is available on the GitHub repository https://github.com/whussain2/Analysis-pipeline. The detailed instructions how to use the source codes and run the analysis pipeline on a local computer is given on the GitHub repository page.

Figure 1
Schematic representation of data analysis work ow adapted in the current R Markdown generated report. The four main steps involved in the work ow process are: a) data importing, b) data pre-processing, c) data modelling, and d) generation of results. a) Data is imported and general information (metadata) on data is recorded in the beginning of the document. b) In pre-processing, data is checked for missing values. Descriptive statistics (mean, mode, coe cient of variation, standard deviation and more) are generated to get a general idea about the data. Interactive heatmaps of eld experimental design are plotted to check the spatial trends in data. Data is visualized using histograms, box plots and QQ plots to check the data distribution, data normality and correlated errors among data points. Lastly, in preprocessing step outliers are identi ed and ltered using a univariate approach. c) In data modelling, a mixed model approach is used to correct for experimental design factors and spatial variations. The detailed description and type of the mixed-models used for analysis is given in the sample report. Modelling of trials is done separately and also through the combined approach if more than one trial is available. d) After modelling various results are extracted, which includes best unbiased linear predictors (BLUPs) or best linear unbiased estimates (BLUEs) depending upon whether genotypes were treated random or xed. Variances and heritability for the variable are also estimated. The genotypes based on BLUPs or BLUEs are ranked to make breeding decisions. Finally, general remarks on the overall data analysis is noted at the end of the document.

Figure 1
Schematic representation of data analysis work ow adapted in the current R Markdown generated report.
The four main steps involved in the work ow process are: a) data importing, b) data pre-processing, c) data modelling, and d) generation of results. a) Data is imported and general information (metadata) on data is recorded in the beginning of the document. b) In pre-processing, data is checked for missing values. Descriptive statistics (mean, mode, coe cient of variation, standard deviation and more) are generated to get a general idea about the data. Interactive heatmaps of eld experimental design are plotted to check the spatial trends in data. Data is visualized using histograms, box plots and QQ plots to check the data distribution, data normality and correlated errors among data points. Lastly, in preprocessing step outliers are identi ed and ltered using a univariate approach. c) In data modelling, a mixed model approach is used to correct for experimental design factors and spatial variations. The detailed description and type of the mixed-models used for analysis is given in the sample report. Modelling of trials is done separately and also through the combined approach if more than one trial is available. d) After modelling various results are extracted, which includes best unbiased linear predictors (BLUPs) or best linear unbiased estimates (BLUEs) depending upon whether genotypes were treated random or xed. Variances and heritability for the variable are also estimated. The genotypes based on BLUPs or BLUEs are ranked to make breeding decisions. Finally, general remarks on the overall data analysis is noted at the end of the document.

Figure 2
Panels showing the screenshots of some of the features in the generated document. 2(a) Shows the general information and table of content to easily navigate the document. 2(b) Shows the example of an interactive heatmap to visualize the eld experimental design and check for spatial variations in eld.
(2c) Shows interactive boxplot. 2(d) Shows the example of interactive tables which can be managed and exported in various formats. 2(e) Shows the example of description of the content to aid in understanding the data analysis, code and procedure. 2(f) Shows the detailed description of one of the spatial models used for the analysis of data.

Figure 2
Panels showing the screenshots of some of the features in the generated document. 2(a) Shows the general information and table of content to easily navigate the document. 2(b) Shows the example of an interactive heatmap to visualize the eld experimental design and check for spatial variations in eld.
(2c) Shows interactive boxplot. 2(d) Shows the example of interactive tables which can be managed and exported in various formats. 2(e) Shows the example of description of the content to aid in understanding the data analysis, code and procedure. 2(f) Shows the detailed description of one of the spatial models used for the analysis of data.