Identication of Essential Protein Domains From High-density Transposon Insertion Sequencing

15 A first clue to gene function can be obtained by examining whether a gene is required 16 for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential 17 genes are usually identified by high-density transposon mutagenesis followed by sequencing 18 of insertion sites (Tn- seq). These studies assign the term “essential” to whole genes rather than 19 the protein domain sequences that confer the essential functions. However, genes can code for 20 multiple protein domains that evolve their functions independently. Therefore, when essential 21 genes code for more than one protein domain, only one of them could be essential. In this study, 22 we defined this subset of genes as “ e ssential d omain- c ontaining” (EDC) genes . Using a Tn-seq 23 data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify 24 EDC genes and the essential protein domains they encode. We found forty candidate EDC 25 genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). 26 This analysis included two knockdowns of genes encoding the protein domains of unknown 27 function DUF2213 and DUF4148. These essential domains are conserved in more than two 28 hundred bacterial species, including human and plant pathogens. Together, our study suggests 29 that essentiality should be assigned to individual protein domains rather than genes, 30 contributing to a first functional characterization of protein domains of unknown function.

we defined this subset of genes as "essential domain-containing" (EDC) genes. Using a Tn-seq 23 data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify 24 EDC genes and the essential protein domains they encode. We found forty candidate EDC 25 genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). 26 This analysis included two knockdowns of genes encoding the protein domains of unknown 27 function DUF2213 and DUF4148. These essential domains are conserved in more than two 28 hundred bacterial species, including human and plant pathogens. Together, our study suggests 29 that essentiality should be assigned to individual protein domains rather than genes, 30 contributing to a first functional characterization of protein domains of unknown function. 31

Introduction 42
A first step when characterizing gene function should be asking whether a given gene 43 encodes an essential cellular function, whether the gene is necessary for the survival of the 44 organism. A widely accepted method to identify essential genes in bacteria is high-density 45 transposon mutagenesis, followed by Illumina-sequencing of the transposon insertion junctions 46 (Tn-seq) 1  gene length and used to predict essentiality. 5-15% sequences from the 3' and 5' ends are 52 usually removed from the analysis, as insertions within the terminal regions are likely non-53 disruptive 2-5 . While disrupted genes are regarded as "non-essential," the method yields a list 54 of putative essential genes as those with zero or very few mapped reads (Figure 1a and b) 3 . 55 Another step towards identifying gene function is the annotation of the protein domains 56 encoded by genes. Protein domains are functional or structural units that can fold, evolve, and 57 function independently. Homology-based protein domain prediction and function assignment 58 are effective starting points for understanding protein function, even when diverse protein 59 architectures add complexity to functional annotations 6,7 . While domain databases such as 60 Pfam 8 and InterPro 9 aim to provide maximum sequence coverage to predict protein domain 61 identity, approximately 30% of all domains listed in these databases (Pfam 33.1 and InterPro 62 81.0) are 'domains of unknown function (DUFs).' Single DUFs are usually predicted to span 63 through functionally uncharacterized proteins. However, studies suggest that at least some of 64 these proteins may contain more than one domain 10,11 . 65 While robust and comprehensive, Tn-seq studies do not consider that genes may encode 66 for more than one protein domain. Tn-seq analysis may classify a gene as "non-essential" due 67 to the presence of transposon insertions in a non-essential coding region, despite the gene 68 coding for a second domain not spanning through the whole gene length that might be essential 69 3,12,13 . We operationally defined this subclass of essential genes as "essential domain-70 containing" (EDC) genes (Figure 1c and d) and set out to identify them in a Tn-seq dataset 71 built-in Burkholderia cenocepacia K56-2 14 . By analyzing biases in transposon density in 72 genes previously identified as "non-essential", we found 40 genes where the encoded proteins 73 contained putative essential and non-essential domains. Using a CRISPR Interference 74 (CRISPRi) 15 platform we developed for Burkholderia 16 , we experimentally confirmed growth 75 defects, representing the loss of an essential function, in 27 EDC gene knockdowns. The 76 identified EDC genes include ten encoding known multidomain proteins and two entirely 77 uncharacterized genes encoding different N-terminal DUFs, demonstrating the utility of the 78 approach. This study highlights that gene essentiality depends on the function of individual 79 protein domains rather than entire proteins. 80

Identification of EDC Genes from Tn-seq Data 82
To identify EDC genes in B. cenocepacia K56-2, we built a custom script that used our 83 previous Tn-seq data 14 to select genes that i) were not previously found to be essential in B. 84 cenocepacia K56-2 14 , and ii) had an asymmetric distribution of transposon insertions ( Figure  85 2). The script split each gene into two equal parts and selected genes with reads in only one 86 region to identify genes with transposon insertion biases. We worked under the assumption that 87 (i) each half could represent one functional domain and (ii) one of the domains may be essential 88 while the other may not. We arbitrarily set the parameters "min ratio" and "min reads" to 0 and 89 0.14, respectively (see Material and Methods and Supplementary Figure 1). These settings 90 looked for genes that had zero reads at one end, while the number of reads in the non-empty 91 end was at least 14% of that region's length. For example, if a section of a gene was 100 bp in 92 length, it would require at least 14 reads mapped to that section to be considered non-essential. 93 With these settings, the script produced an extensive list of 178 candidate EDC genes 94 Table 1). 95

Bioinformatic Analysis of the Candidate EDC Genes 96
We reasoned that if EDC genes contained essential protein domains, then the essential 97 protein domains may be encoded by essential genes in at least some other bacteria. We then 98 searched for essential ortholog genes of the 178 candidate EDC genes by BLASTx searches 99 against the 'Database of Essential Genes (DEG) 17 using 50% sequence alignment and 30% 100 sequence identity as the cut-off. We found that 40 of the 178 genes had orthologs annotated as 101 'essential" in other bacterial species. We wished to interrogate the domains encoded by these 102 40 genes using UniProt 18 based on InterPro domains 9 . InterPro predicts the domain 103 information by matching the protein or nucleic acid sequences against the member databases 104 (collectively known as InterPro consortium) to identify 'signatures' associated with known 105 domains. Thus, the InterPro prediction relies on the availability of sequence characterization 106 and annotation. This analysis showed that from the 40 candidate EDC genes predicted to be 107 essential by homology with other essential genes, 10 genes encoded multidomain proteins, and 108 7 of them were well-characterized, such as the N-terminal domain of DnaK and NusA ( Figure  109 3a). The remaining genes were predicted to have one single annotated domain (19 genes) that 110 did not span the whole gene-length or encoded uncharacterized proteins (11 genes) 111 (Supplementary Table 2). All 40 genes had transposon insertions located in one half of the 112 gene, showing that the script was able to identify genes with biased transposon insertions 113 Figure 2). Taken together, these results suggest that the identified genes 114 could be essential due to the presence of essential protein domains orthologues. Notably, 17 115 DNA regions were identified as coding for new putative essential protein domains ( Table 1). 116 118

CRISPRi Knockdowns of EDC Genes Show Growth Defects 119
To phenotypically characterize the effect of knocking down EDC genes, we used 120 CRISPR interference or CRISPRi 16 to create knockdown mutants of the genes of interest. 121 CRISPRi comprises a chromosomally integrated dCas9 under the control of a rhamnose-122 inducible promoter and plasmid-borne sgRNA driven by a constitutively active synthetic 123 promoter, PJ23119 16 . Simultaneous expression of dCas9 and a target-specific sgRNA allows the 124 dCas9 to bind the target DNA region and, thus, sterically interfere with transcription by RNA 125 polymerase 15,16 . To inhibit the expression of the candidate genes, we designed two sgRNAs 126 against each of the candidate genes targeting the start codon and adjacent region on the non-127 template strand (Supplementary Figure 3a and c). For phenotypic characterization, we grew 128 the cells in LB with and without rhamnose. Upon induction of dCas9 with rhamnose, 27 out of 129 the 40 candidate genes showed at least 25% growth inhibition relative to the uninduced 130 condition (Supplementary Figure 3b-d). 131

DUF2213 and DUF4148 Appear to be Essential Domains 132
The presence of DUFs is a common feature of hypothetical or uncharacterized proteins. 133 To initiate functional characterization of DUFs, we focused on two genes containing DUF-   (Figure 4b and Supplementary Table 5). DUF2213 is also 150 present in many phage-related proteins (Figure 4a). Eight unique domain architectures were 151 observed for proteins containing DUF2213 and five for DUF4148 (Figure 4c-d). DUF2213 is 152 associated with another essential domain PF00293, a NUDIX hydrolase (Figure 4c). In other 153 proteins, DUF2213 is associated with the LPD3 domain (PF18798) and DUF1073 (PF06381) 154 which is also conserved across bacterial species 11 (Figure 4c). On the other hand, Pfam 155 analysis of DUF4148 shows that DUF4148 differs in domain length among species and is 156 associated with the Pfam domain PF00144, known to confer resistance against β-lactams 157 (Figure 4d) 19 . Nonetheless, the N-terminus was highly conserved, suggesting it is functionally 158 significant. The Pfam-based analysis of species distribution also revealed that DUF2213 is 159 present in six eukaryotic species (five metazoans and one fungal species), whereas DUF4148 160 is present in five eukaryotic species (three viridiplantae species and two metazoan species). 161 The widespread distribution of these DUFs indicates the functional importance of these 162 essential domains, creating an impetus for further characterization. That is the case of essential genes identified by Tn-seq 1 . In standard Tn-seq analysis the 169 condition of essentiality is assigned to genes and not to domains, resulting in incorrect 170 classification of many essential genes as non-essential. Rather, essentiality assignment pipeline 171 should be revised to analyze the essentiality of individual protein domains 21 . Indeed, 172 essentiality can be assigned to individual domains of a multidomain protein rather than the 173 entire protein 12,13 . In this work, we defined as essential-domain-containing (EDC) genes those 174 genes that encode more than one protein domain, with one of the domains coding for an 175 essential function. By analyzing a Tn-seq dataset 14 for transposon insertion biases, we show 176 that standard Tn-seq analysis pipelines may miss EDC genes, whose detection often requires 177 either manual curation or additional considerations 22 . 178 We validated our approach by identifying previously characterized multidomain 179 essential proteins in which the essential function is assigned to one single domain. For instance, 180 our analysis of biases in the Tn-seq dataset showed that the N-terminal domain of NusA 23 is 181 sufficient to mediate the essential function, in agreement with previous work 24 . Similarly, the 182 B. cenocepacia K56-2 dnaK gene was previously defined as non-essential 14 ; however, we 183 found that the Tn-seq reads mapped onto dnaK were biased toward the C-terminal domain 184 (CTD), suggesting that only the NTD is necessary for its essential function. (Figure 3b, CRISPRi is more effective in blocking transcription initiation than elongation, and is the most 202 efficient in silencing gene expression when promoter regions are targeted with gRNAs 15,28-30 . 203 However, as promoter regions for B. cenocepacia genomes remained largely unannotated we 204 targeted translation start sites. It remains to be investigated whether targeting the promoter 205 region to block the transcription initiation rather than elongation might yield conditional a 206 growth phenotype in the remaining 13 genes. 207 A large portion of the protein domains that lack functional assignment can be grouped 208 within the DUF category. DUFs are members of ever-increasing uncharacterized protein 209 families; they are the object of experimental and computational efforts towards their functional 210 characterization 10,31-33 . Determining if a DUF is essential is among the first steps in functional 211 characterization. In this study, we focused on two EDC genes that encode putative essential 212 DUFs: DUF2213 and DUF4148. Both domains have a high degree of conservation across 213 diverse phyla, which highlights their biological relevance. DUF2213, a phage-associated 214 domain (PF09979), is well distributed across bacteria and phages. Interestingly, we found that 215 DUF4148 (PF13663) is putatively essential and associated with β-lactamase (PF00144) 216 (Figure 4). 217 In summary, our study identified 27 EDC genes whose knockdown produced a growth 218 defect, highlighting the essential nature of one of their protein domains. By leveraging a Tn-219 Seq dataset in B. cenocepacia K56-2 14 , we demonstrate that the essential nature of protein-220 coding genes is a function of the individual protein domains they encode. We propose that 221 determining essentiality of a domain of unknown function should be the first step in the process 222 to define their function. 223

Bacterial Strains and Growth Conditions 225
The list of bacterial strains and plasmids used in this study is provided in 226 Supplementary Table 3. Bacterial strains were grown in LB-Lennox medium (Difco) at 37 0 C. 227 40µg/mL (Fisher Scientific). Donor strains of E. coli DH5α and B. cenocepacia K56-2 carrying 229 the sgRNA plasmids were selected in trimethoprim 50µg/mL and 100µg/mL (Sigma), 230 respectively. 231

Identification of EDC Genes from Tn-Seq Dataset 232
Candidate EDC genes were identified with a custom python script using the Tn-seq 233 dataset 14 . The script analyzed every gene previously classified as "non-essential" by splitting 234 it into two equal halves and counting the number of reads mapped to each half-gene. The script 235 then used the "min ratio" and "min reads" as filtering criteria to call EDC genes. "Min ratio" 236 was defined as the desired ratio of reads between the halves of the gene. "Min reads" was 237 defined as the minimum number of reads in the non-empty end that is equal to a 14% of that 238 half's length. Min reads was set to 0.14, while min ratio was set as 0. For each gene, 10% from 239 each end of the gene was discarded from the analysis. The parameters can be changed to yield 240 either more stringent or more general results. The script is available at 241 https://github.com/cardonalab/EssentialDomains 242

Bioinformatic Analysis 243
Orthologous essential genes were identified using BLASTx against DEG 15 17 .

Creating Knockdown Mutants of the Candidate EDC Genes with CRISPRi 254
CRISPRi mutants were of the EDC genes were created as previously described 16 . 255 Briefly, pSCB2-sgRNAv2, a modified plasmid from pSCB2-sgRNA 16 , was used as the 256 template for inverse PCR to insert 20bp target-specific sgRNA sequence. Inverse PCR was 257 performed using Q5 high-fidelity polymerase (NEB), forward primers with individual sgRNAs 258 as 5' tail, and 1092 as the reverse primer. The resultant fragments were ligated to create circular 259 plasmids by incubating 0.5µL of the respective PCR products with quick ligation buffer (NEB),

Conditional Growth Phenotype Analysis of the CRISPRi Mutants 271
To determine the conditional growth phenotype of the candidate genes, overnight 272 cultures of the CRISPRi mutants were back diluted to OD600nm 0.01. The cultures were grown 273 at 37 0 C for 20-24 hours with continuous shaking in a 384-well plate containing LB broth 274 supplemented with trimethoprim 100µg/mL and with/without 1% rhamnose. OD600nm readings 275 were taken at 1 hour intervals using BioTek Synergy 2 microplate reader. While disrupted genes are regarded as "non-essential," the method yields a list of putative essential genes as those with zero or very few mapped reads (1a and 1b) We operationally de ned this subclass of essential genes as "essential domain-containing" (EDC) genes (1c and 1d)   DUF2213 is also present in many phage-related proteins (4a). DUF4148 is found in 204 bacterial species, primarily in Burkholderia species(4b) Eight unique domain architectures were observed for proteins containing DUF2213 and ve for DUF4148 (4c-d).