Mining the Conserved Transcriptional Response to Phytophthora Infection for Factors Beyond the Known PR Genes

2 Oomycetes of the genus Phytophthora are devastating plant pathogens that affect many com- 3 mercially important plants. Considerable efforts have been made to investigate the transcriptional 4 response of individual plant species to phytophthora infection, often showing a concerted upreg- 5 ulation of pathogen-response (PR) gene families, which are also induced upon infection by fungi 6 and other biotic and even non-biotic stressors. By integrating four transcriptomics datasets de- 7 rived from three different plants (arabidopsis, soybean, cocoa), a core set of upregulated se- 8 quence clusters was derived, which represents a conserved multi-species response to phy- 9 tophthora infections. We annotated more than 300 common induced gene clusters and subjected 10 them to bioinformatical analysis. Besides the expected PR genes, several novel gene families 11 without known links to biotic stress were found to be strongly induced in all tested datasets. Among 12 the most prominent response genes are two families of putatively secreted peptides and a family 13 of predicted mitochondrial complex-IV associated proteins. Interestingly, the latter sequences are 14 related to the mammalian NDUFA4 family, which also contains members with constitutive and 15 pathogen-induced expression. This recurrent functional diversification points toward an important 16 role of complex IV regulation within the biotic defense response in multiple kingdoms. 17 MISTR-A. 19 under the different experimental conditions. For Arabidopsis data, different time points after inoculation are

Phytophthora with varying host ranges; some of these species like P. sojae and P. megakarya 30 are restricted to a single known species, while others like P. capsici, P. ramorum, P. parasitica surface appressoria and is followed by establishment of an apoplastic hyphal network. At the 2 biotrophic stage, Phytophthora produce haustoria in plant cells. These structures participate in 3 the acquisition of nutrients and release virulence proteins known as effectors (Petre and Kamoun,4 2014; Dong and Ma, 2021). This is followed by the necrotrophic stage, characterized by tissue 5 necrosis of the host and the production of numerous sporangia that release zoospores (Judelson During this infection stage, the host plant can sense the pathogen, either directly or by virtue of 8 its activity, and then mount a more or less targeted defense response. As the first line of defense, 9 pathogen-associated molecular patterns (PAMPs) are recognized by plant surface receptors, 10 which then initiate a defense response called PAMP-triggered immunity (PTI) (  identification of resistance traits (Meng et al., 2021). The aim of the current study was to mine 39 plant transcriptomics data for novel and unconventional regulatory events that broaden our un-40 derstanding of the host response to phytophthora infections. Virtually every dataset measuring 41 the transcriptional response to pathogens contains a number of upregulated genes that do not 1 belong any of the known PR-classes and don't have a (predicted) function connected to the de-2 fense response -if they have any annotated function at all. Some of these observations might be 3 due to artefacts (Petri et al., 2012), but others might represent unexpected but genuine regulations. 4 In order to reliably identify novel and biologically relevant regulations over a background of tran-5 scriptional noise, we have applied a bioinformatical pipeline that searches for conserved regula-6 tion events in small gene families from different species, measured by different transcriptomics 7 platforms. The gene families targeted by this approach may either be groups of orthologs, or 8 include closely related paralogs as they are often observed in lineage-specific gene duplications 9 (Sonnhammer and Koonin, 2002). Requiring a conserved regulation mode should deplete spuri-10 ous false-positive observations and emphasize gene families, whose pathogen-responsive induc-11 tion is biologically meaningful and has been selected for by evolution. 12 As the basis of our analysis, we selected four high-quality transcriptomics datasets generated 13 from three plant species (arabidopsis, soybean and cocoa) and four different phytophthora spe- Not only do the selected datasets differ in the host species, they also use a diverse set of infection 35 conditions, time points post inoculation, and measurement modalities. To reach a common frame 36 of reference for inter-species comparisons, a unified 'combined logarithmic induction factor' (CLI-37 factor) was established for each host species. The idea of this combined factor was to select all 38 genes that are either strongly induced in one of the experimental conditions, or moderately in-1 duced in multiple conditions. The details of the CLI calculation are outlined below. In brief, the 2 log-induction values of replicate experiments were averaged, and the remaining N conditions 3 were considered orthogonal coordinates in an N-dimensional 'regulation space'. The species-4 specific CLI factor is the length of each gene's regulation vector. In general, genes surpassing a 5 CLI factor of 4.0 were considered as strongly induced and were included in the multi-species 6 clustering analysis. This condition was met by 1766 Arabidopsis thaliana genes, 1019 genes from 7 Glycine max, and 1451 genes from Theobroma cacao. 8 Definition of a conserved induction signature 9 The next step of the analysis was to cluster the strongly Phytophthora-induced genes from three 10 plant species by their sequence relationship. One aim of this clustering was to establish connec-11 tions between orthologous regulated genes in different species; a second aim was to define fam-12 ilies of commonly upregulated genes including those with multiple members in one species. Upon 13 visual inspection of the induced gene lists, it became clear that many of them belong to huge 14 sequence families (e.g. kinases, oxidases, leucine-rich repeat proteins etc.) that also contain 15 many non-regulated members. As a consequence, a finer granularity in sequence clustering is 16 required to identify regulated groups of bona fide orthologs, possibly including their closely related 17 inparalogs (Sonnhammer and Koonin, 2002). 18 To address this issue, protein sequences encoded by the strongly induced genes were compared 19 to each other using BLAST (Altschul et al., 1990). In addition, all sequences were compared to a 20 database of outgroup sequences, representing various taxa outside the flowering plants.  cant BLAST hits between pairs of induced proteins were accepted only if these hits were better 22 (lower composition-adjusted E-value) than any match between one of the candidates and the 23 outgroup database. In total, 13612 connections fulfilled these criteria and were rendered into a 24 network diagram using the Cytoscape software (Shannon et al., 2003). As can be seen in the 25 overview figure 1, the outgroup filtering prevented the generation of huge non-informative super-26 families and segmented the network into manageable clusters, most of them with less than twenty 27 member-proteins from multiple species. The full network is provided as Supplementary file S1. 28 Since plant genome databases contain large multi-domain proteins and occasionally also fusions 29 between otherwise unrelated genes, it is not guaranteed that all proteins connected by uninter- ysis ( Table 1, Supplementary file S3). While the more homogeneous PR-families are represented 7 by a single sequence cluster, more diverse families such as β-1,3-glucanases (PR-2), chitinases 8 (PR-3) or peroxidases (PR-9) are subdivided into multiple USCs. The observed segmentation of 9 protein families into multiple USCs usually indicates the presence of upregulated subfamilies 10 within a larger superfamily, which may also contain non-induced genes. One striking example is 11 the peroxidase superfamily, which is very heterogeneous and contains enzymes localized in dif-  17 Two of the strongest upregulated clusters comprise short proteins with a predicted extracellular 18

Strong induction of secreted peptide precursors
localization, which might act as precursors for signaling or defense peptides. Cluster #126, which 19 consists of the two Arabidopsis genes (AT2G23270 and AT4G37290), one Theobroma gene 20 (Tc01_p005760) and two Glycine genes (glyma05g21360 and glyma17g18230) showed a partic-21 ularly strong and consistent 50 to 300-fold upregulation in the different infection models ( Figure  22 2). There were no major differences between the different time points in the Arabidopsis model fore not a member of the upregulated PREPIP2/3 sequence cluster. Instead, it forms a separate 30 USC #266 together with two Theobroma genes (Tc04_p018060 and Tc04_p018070), which are 31 also transcriptionally induced, albeit not as strongly as PREPIP2 and 3 ( Figure 2a). 32 Our bioinformatical analysis shows that members of USC #126, including Arabidopsis PREPIP2 33 /3 are members of a protein family found in all major dicotyledon lineages, which is characterized 34 by a conserved C-terminal ~22aa repeat that is present in two, four or ten tandem copies in the which is absent from PREPIP1 and PCEP, and the strong response to phytophthora infection 1 suggest that these proteins are particularly relevant for signaling an infection by oomycetes. 2

A new family of mitochondrial anti-ROS protection factors
3 An unexpected, but particularly interesting finding was the strong upregulation of sequence clus-4 ter #87, comprising a number of short mitochondrial proteins. A detailed bioinformatical analysis 5 of this family revealed that these proteins are distant relatives of the NDUFA4 subunit of the mi-6 tochondrial complex IV, also known as the cytochrome c oxidase complex (Figure 3a). Plant mem-7 bers of the NDUFA4 family had not been described so far. All land plants encode multiple mem-8 bers of the NDUFA4-like family, but in the three plant models studied here, only selected family 9 members are strongly upregulated by phytophthora and are thus included in USC #87 ( Figure  10 3b,c). The sequences of the plant NDUFA4 homologs have substantially diverged from their ani-11 mal counterparts (Figure 3a), and the dendrogram analysis suggests that the diversity of the plant 12 members (2 genes in Arabidopsis, 3 genes in Theobroma, and 7 genes in Glycine) arose after 13 the split between the plant and animal kingdoms (Figure 3a,c). 14 Four of the plant genes (AT3G29970, Tc00_p011650, glyma08g27620 and glyma18g50800) are 15 particularly responsive to Phytophthora infection and form a clade in the dendrogram (Figure 3c).

16
A second, less strongly induced clade is formed by the USC #87 members Tc02_p026410 and 17 glyma18g01340, together with glyma11g37370. The latter gene is not annotated as part of cluster 18 #87, since its upregulation factor did not quite meet the CLI >4.0 criterion. A third clade is formed 19 by the constitutively expressed family members AT3G48140, Tc03_p000330, glyma04g34840 20 and glyma06g19850, which are not induced by Phytophthora and thus not members of any USC 21 ( Figure 3c). Mammalian genomes also contain multiple members of the NDUFA4 family, which 22 result from independent gene duplication events ( Figure 3c). Interestingly, human NDUFA4 is 23 constitutively expressed, while its two paralogs NDUFA4L2 and C15ORF48 (also known as and appears to protect mitochondria from excessive ROS production during the defense response. 28 Besides the NDUFA4 family, there are also several other upregulated clusters populated with 29 mitochondrial proteins. Among them is cluster #176, whose member proteins appear to be ho- ing to a particular biological response pathway is the presence of spurious signals -in particular 5 when searching for 'novel' or 'unexpected' regulation events. The current study addresses this 6 problem by comparing multiple datasets obtained from different species in similar infection situa-7 tions. By focusing on gene families with consistent up-regulation, spurious data are efficiently 8 depleted since they are unlikely to be found in multiple species, measured by different tran-9 scriptomics platforms. A downside inherent to this approach is the likely loss of species-specific 10 regulations, or those that only occur in particular experimental setups. Nevertheless, the 'con- PROPIP2 and PROPIP3 contain a C-terminal repeat, suggesting that multiple PIP2-and PIP3-39 like peptides can be released from these precursors. In particular the cocoa protein 40 Tc01_p005760 is predicted to generate no less than ten such peptides -all of them related, but 41 with distinct differences, making it likely that they can bind to multiple receptor types (Zhang et al.,  The comprehensive cluster table provided as Supplementary file S2 contains  2 many additional clusters of phytophthora-responsive genes in multiple species, whose exact bio-3 logical function remains to be uncovered. The bioinformatical annotations added to the clusters 4 provide a first step into this direction. taset is based on the NimbleGen T. cacao 28k array and comprises data measured 72h post 3 inoculation in five-fold replicates. For the current analysis, an averaged log-2 upregulation factors 4 (relative to the water-inoculated samples) was calculated. A second, independent dataset meas-5 uring the response of susceptible cocoa pods to infection with Phytophthora palmivora and Phy-6 tophthora megakarya was downloaded from the data supplement of (Ali et al., 2017). This dataset 7 is based on RNA-Seq data using the Illumina HiSeq2000 platform and provides normalized RPKM 8 (Reads Per kilobase of transcript per million mapped reads) values for infected and uninfected 9 plants, averaged from three replicates (Ali et al., 2017). After averaging the replicates, the CLI 10 score was calculated from the three orthogonal datasets as described above. Since the Fister et were included in the multi-species clustering analysis. 17

Sequence clustering of upregulated genes 18
The top-ranking gene products from each species (exceeding the species-specific score of 4.0) 19 were combined in a common fasta-formatted protein sequence file and subjected to all-against-20 all similarity searches using BLAST (Altschul et al., 1990). In addition, each protein was also 21 compared to the proteome of four outgroup species belonging to different taxa (Selaginella moel-22 lendorffii, Chlamydomonas reinhardtii, Saccharomyces cerevisiae and Homo sapiens). All se-23 quences were obtained from the Uniprot database (UniProt, 2021). Upregulated Sequence Clus-24 ters (USCs) were defined by joining upregulated gene products from Arabidopsis, Glycine and 25 Theobroma when they fulfil the following criteria: i) cluster members are connected to each other 26 by significant BLAST scores with E-value < 10 -3 ; and ii) the within-cluster BLAST scores are better 27 (lower E-value) than any match between a cluster member and one of the outgroup sequences.

28
The resulting list of protein-to-protein connections was exported into a Cytoscape format file 29 (Shannon et al., 2003), using the negative decimal logarithm of the BLAST e-value as the con-30 nection weight for the links between gene products. The resulting Cytoscape file is provided as 31 Supplementary file S1. 32  table whenever 75% of the USC members had concordant localization pre-3 dictions. If no functional information was available for any of the cluster members, the protein 4 sequences were subjected to a bioinformatical analysis using BLAST (Altschul et al   The N-terminal signal sequence is shown in red, the C-terminal repeat is highlighted on black and 18 grey background for residues invariant or conservatively replaced in 50% of all sequences. C: 19

Cluster annotation and sequence analysis
Comparison in the extended NDUFA4 family. The three first sequences are from humans, the other from plants. 29 Sequence order corresponds to the dendrogram in panel C. Residues that are invariant or con-30 servatively replaced in at least half of the sequences are shown on black or grey background, 31 respectively. C: Dendrogram analysis of NDUFA4 family shows independent gene duplication 32 events in the mammalian and plant lineages. The expression trend of the different clades is shown 33 at the right-hand side.   Network-display of upregulated sequence clusters (USCs) The network shows the connected fraction of the upregulated gene list, including 930 out of 1766 Arabidopsis thaliana genes (red dots), 906 out of 1019 Glycine max genes (green dots) and 1120 out of 1451 Theobroma cacao genes (blue dots). Lines represent signi cant BLAST hits, line thickness scales with BLAST signi cance (-log p-value).
Unconnected dots are not shown. A full version of this gure, including gene names and annotations, is provided as Supplementary le S1.

Figure 2
Upregulation of the PREPIP family of putative peptide precursor genes. A: Logarithmic induction factors of individual genes in USC #126 (PREPIP2/3 family, orange/brown color) and USC #266 (PREPIP1 family, blue color) for the different experimental conditions. For Arabidopsis data, different time points after inoculation are shown. For Cocoa data, different pathogens (Colletotrichum theobromicula, Phytophthora palmivora, Phytophthora megakarya) are shown. For Glycine data, the averaged response of susceptible and resistant strains is shown. B: Sequence display of USC #126 members, revealing their repeat structure. The N-terminal signal sequence is shown in red, the C-terminal repeat is highlighted on black and grey background for residues invariant or conservatively replaced in 50% of all sequences. C: Comparison of the C-terminal peptide repeat in the PIP2/3 family (top), PIP1 family (middle) and PCEP family (bottom). Sequence conservation is shown as WebLogo (Crooks et al., 2004), wherein the size of the letters scales with the conservation of the respective amino acids. Figure 3 Plant relatives of the NDUFA4 family A: Logarithmic induction factors of individual genes of USC #87 under the different experimental conditions. For Arabidopsis data, different time points after inoculation are shown. For Cocoa data, different pathogens (C. theobromicula, P. palmivora, P. megakarya) are shown. For Glycine data, the averaged response of susceptible and resistant strains is shown. B: Sequence conservation in the extended NDUFA4 family. The three rst sequences are from humans, the other from plants. Sequence order corresponds to the dendrogram in panel C. Residues that are invariant or con-servatively replaced in at least half of the sequences are shown on black or grey background, respectively. C: Dendrogram analysis of NDUFA4 family shows independent gene duplication events in the mammalian and plant lineages. The expression trend of the different clades is shown at the righthand side.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. S1USCnetwork.cys S2allUSCs.xlsx S3fullPRtable.xlsx S4preprocessedsourcedata.xlsx