Potential aggregation hot spots in recombinant human keratinocyte growth factor: a computational study

Abstract The recombinant human keratinocyte growth factor (rhKGF) is a highly aggregation-prone therapeutic protein. The high aggregation liability of rhKGF is manifested by loss of the monomeric state, and accumulation of the aggregated species even at moderate temperatures. Here, we analyzed the rhKGF for its vulnerability toward aggregation by detection of aggregation-prone regions (APRs) using several sequence-based computational tools including TANGO, ZipperDB, AGGRESCAN, Zyggregator, Camsol, PASTA, SALSA, WALTZ, SODA, Amylpred, AMYPDB, and structure-based tools including SolubiS, CamSol structurally corrected, Aggrescan3D and spatial aggregation propensity (SAP) algorithm. The sequence-based prediction of APRs in rhKGF indicated that they are mainly located at positions 10–30, 40–60, 61–66, 88–120, and 130–140. Mapping on the rhKGF structure revealed that most of these residues including F16-R25, I43, E45, R47-I56, F61, Y62, N66, L88-E91, E108-F110, A112, N114, T131, and H133-T140 are surface-exposed in the native state which can promote aggregation without major unfolding event, or the conformational change may occur in the oligomers. The other regions are buried in the native state and their contribution to non-native aggregation is mediated by a preceding unfolding event. The structure-based prediction of APRs using the SAP tool limited the number of identified APRs to the dynamically-exposed hydrophobic residues including V12, A50, V51, L88, I89, L90, I118, L135, and I139 mediating the native-state aggregation. Our analysis of APRs in rhKGF identified the regions determining the intrinsic aggregation propensity of the rhKGF which are the candidate positions for engineering the rhKGF to reduce its aggregation tendency. Communicated by Ramaswamy H. Sarma


Background
Keratinocyte growth factor (KGF), also known as FGF-7, is a member of the fibroblast growth factor (FGF) family with the proven therapeutic properties in wound healing (Dastjerdeh et al., 2019;Rajan et al., 2010). As a potent paracrine-acting epithelial-specific mitogen, KGF acts exclusively through the FGFR2-IIIb spliced variant of FGFR2 expressed by the epithelial cells and is able to protect these cells from various injuries like chemotherapeutic agents used in cancer therapy (Huang, 2012). A truncated form of KGF known as Palifermin, hereafter referred to as rhKGF is used as the only FDA-approved medication for the treatment of chemotherapy-and radiotherapyinduced oral mucositis (Dastjerdeh et al., 2019). It has shown the ability to reduce the severity, incidence, and duration of oral mucositis in cancer patients (Hsu et al., 2006). However, KGF is a fairly unstable and highly aggregation-prone therapeutic protein denaturing at $37 C (around the body temperature) which is the lower end of temperature stability range for most of the globular proteins (Huang, 2012). Low conformational stability of rhKGF is attributed to the repulsion of the positively charged residues in the clusters forming the heparin-binding sites. Following the unfolding event, the high intrinsic aggregation liability of rhKGF which is attributed to the presence of aggregation-promoting hotspots leads to the formation of irreversible aggregation products including particles and precipitates (Hsu et al., 2006).
According to the Lumry and Eyring equation, aggregate formation can be suppressed either through stabilizing the native state or reducing aggregation rate (Rajan et al., 2010). Various stabilizers including heparin, sulfated polysaccharides like dextran sulfate, anionic polymers, osmolytes like N, N 0 -dimethylglycine, trehalose, and sucrose, and high concentrations of salts including sodium phosphate, ammonium sulfate, sodium citrate, and sodium chloride were successfully used to reduce rhKGF aggregation tendency via increasing its denaturation temperature and conformational stability. The stabilizing effect of these factors on the rhKGF may be due to the neutralization of the positive charges (Chen et al., 1994). The solid-phase PEGylating of rhKGF also significantly improved the stability without affecting its structure (Huang, 2012). One of the major studies on KGF stability was performed by Amgen Company to generate more stable, yet active analogs of KGF. The most stable analog representing the optimal balance between the stability and activity was the truncated form of KGF lacking the 23 N-terminal amino acids marketed as Palifermin (Hsu et al., 2006). However, the high aggregation tendency, low storage stability and short plasma half-life of rhKGF has remained a challenge limiting its pharmaceutical applications (Poorebrahim et al., 2017;Zahra et al., 2016). Protein aggregation can also reduce the protein binding affinity, elicit immune responses, and hamper the safety and efficacy of rhKGF as other biotherapeutics (Buck, 2012;Wang & Roberts, 2010). As such, predicting and mitigating the aggregation of rhKGF is of great importance from a pharmaceutical perspective. Understanding of the aggregation mechanisms obtained via computational and experimental discoveries has led to the creation of robust algorithms for prediction of "aggregation-prone regions" (APRs), i.e. the sequence and structural features of proteins facilitating protein aggregation (Lee et al., 2013). To the best of our knowledge, there is no previous study addressing the intrinsic aggregation propensity of rhKGF.
Our results identified the candidate positions for the rational design of rhKGF to reduce its vulnerability toward aggregation while maintaining (or even increasing) its potency and specificity. Developing the less aggregationprone variants of rhKGF might reduce the need for stabilizers used in the drug formulation, allow preparation of the drug in the ready-to-use prefilled syringes, increase the drug shelflife or storage stability, and increase the in vivo half-life which can result in the reduction of the required therapeutic doses, adverse effects and treatment costs. This approach might also be applicable to reduce aggregation propensity of the other aggregation-prone therapeutic proteins. Our findings of the aggregation hot spots in rhKGF may be useful for the other FGF family members.

Methods
The rhKGF sequence and structure were analyzed using most of the available APR detection tools applying different algorithms to prevent biases from the training sets, parameterization, and the specific characteristics of any given method. A schematic diagram of the study was presented in Figure 1.

Retrieval of the rhKGF sequence
To identify the aggregation hotspots in the rhKGF using the sequence-based APR prediction tools, its amino acid sequence was retrieved from the DrugBank database (the accession number DB00039) and the Uniprot ID of P21781-1 except for deletion of 23 N-terminal residues .

Homology modeling of rhKGF
Since the three-dimensional structure of rhKGF has not yet been determined by experimental techniques, homology modeling (template-based modeling) was performed using GalaxyTBM server to predict its 3 D-structure (Ko et al., 2012). The server used 1QQL_A as the template which belongs to the Rattus Norvegicus FGF7 crystal structure. The quality of the regenerated models was analyzed using MolProbity structure-validation server (http://molprobity.biochem.duke. edu/) based on the protein geometry properties including the percentage of poor and favored rotamers, Ramachandran favored and outliers, Cb deviations, bad bonds, bad angles and cis-prolines. The best model was refined using GalaxyRefine2 web server. All-atom contact analysis besides covalent-geometry and torsion-angle criteria was analyzed using MolProbity for all of them. The less MolProbity score, the highest quality model is achieved. The model validation was also performed using PROCHECK and Verify3D (https:// servicesn.mbi.ucla.edu).

Sequence-based detection of APRs in rhKGF
The amino acid sequence of the rhKGF was analyzed using different sequence-based APR detection tools whose basis of prediction and methodologies are described in the following sections.

TANGO
TANGO applies a statistical mechanics based algorithm for the prediction of protein aggregation nucleating regions as well as the effect of mutations and environmental conditions on the aggregation propensity of these regions (Buck, 2012). TANGO calculates the cross-beta aggregation in peptides and denatured proteins and considers the propensity of different competing structural states including b-turn, alphahelix, and b-sheet (Buck, 2012). For the prediction of each conformation propensity, a partition function is used and cross-beta aggregation is predicted based on the assumption that the core regions involved in the aggregation process are fully buried and satisfy their hydrogen-bond potential. TANGO considers the protein as several overlapping peptides and calculates their aggregation tendency. The environmental conditions like stability, pH, ionic strength, protein concentration and concentration of denaturant TFE are also considered in the TANGO aggregation score. To analyze the rhKGF sequence with TANGO, a text file containing rhKGF sequence as twenty six overlapping peptides (known as the Sup) with the length of ten residues, each one containing five common residues with the next sup was prepared with the default pH, temperature and ionic strength conditions. For the data interpretation, any residue with an aggregation score above 5% over 5-6 residue is a potential aggregationprone region (APR).

AGGRESCAN
AGGRESCAN is a server for the identification of aggregation nucleating segments or "aggregation hot spots" in the polypeptides based on an aggregation-propensity scale obtained from the experimentally validated hotspots in a group of both natively unfolded and pathogenic proteins like Ab42, synuclein, prion, amylin and etc. AGGRESCAN is also able to detect the mutation effects on protein aggregation tendency as a function of protein sequence composition. For each protein sequence, AGGRESCAN calculates and presents: 1. An aggregation-propensity value for each amino acid (a 3 v), a 3 v windows average (a 4 v) and a graphical view of the protein aggregation profile (AP) 2. The area of the aggregation profile (Hotspot Area: HSA) above the hotspot threshold (HST) in a given "hot spot" and the graphical representation of the peak area 3. Putative aggregation "hot spots" which are defined as the regions with five or more residues with an a 4 v larger than the HST. 4. HSA divided by the number of residues in the entry sequence, known as "Normalized Hotspot Area" (NHSA) per residue (Conchillo-Sol e et al., 2007) To make APR prediction using AGGRESCAN, the rhKGF sequence was submitted to the AGGRESCAN database available at http://bioinf.uab.es/aggrescan/ and all of the above parameters were calculated.

ZipperDB
ZipperDB predicts protein aggregation by evaluation of fibrilforming propensity of the protein segments using the 3 D-profile method (Kuhlman & Baker, 2000). ZipperDB contains segments with high fibril-forming propensity obtained from the analysis of more than 20,000 protein sequences that can form the "steric zipper" structure consist of two self-complementary beta-sheets creating the spine of an amyloid fibril (Nelson et al., 2005). Each proline-free hexapeptide from a protein sequence is threaded onto the NNQQNY structure and its energetic fit is calculated by RosettaDesign program.
To avoid the problems with disulfide bonds, the Cysteine residues are replaced with Serine during modeling. Based on the experimental amyloid-like fibrils, the energy-threshold of À23 kcal/mol was determined. The sequences with the energy level equal to or below the À23 kcal/mol are defined as segments with high fibril-forming propensity (Sawaya et al., 2007). To analyze the rhKGF (Palifermin) with the ZipperDB, its sequence was submitted to ZipperDB web interface (https://services.mbi.ucla.edu/zipperdb/intro) and a graphical representation of its fibril-formation propensity besides the following data were provided for each peptide segment (The preferences presented below are related to the fibril-forming tendency): 1. Rosetta energy of one layer composed of two betastrands (lower energies are preferred) 2. Shape complementarity of the steric zipper interface (range 0-1, higher values are preferred) 3. Area of Interface: The solvent-accessible surface area (SASA) at the steric zipper interface per layer (higher values are preferred) 4. The contact area between two sheets (the larger area is preferred) 5. The SASA of the five-layer structure (lower values are preferred) 6. The composite score obtained from the combination of Rosetta energy, shape complementarity and area of the interface (lower values are preferred)

Zyggregator
Zyggregator is also a sequence-based method which calculates different propensities including the local stability of the monomeric state (ln Pi), formation of b-rich oligomers (Z i tox ), and the formation of fibrillar aggregates (Z i agg ) to predict the intrinsic amyloid aggregation propensity of the protein. The Z i agg score 0 is related to the aggregation propensity equal Figure 1. A schematic diagram describing the method applied for prediction of aggregation hotspots in rhKGF using different APR detection tools applying different algorithms.
to that of a random sequence at position i, and when it is one standard deviation more aggregation-prone the score is 1. The Z i agg score 1 or more is related to the aggregationprone residues (Tartaglia & Vendruscolo, 2008).
To analyze the rhKGF with zyggregator, its sequence was submitted to the zyggregator database available at http:// www-mvsoftware.ch.cam.ac.uk/index.php/zyggregator and the above parameters were calculated for it. Then a graphical representation based on the residues Z i agg score was prepared.

CamSol
CamSol consists of both a sequence-, and a structure-based algorithm for the calculation of protein solubility. The first algorithm, known as CamSolintrinsic, is a sequence-based method for the prediction of intrinsic solubility of the protein unfolded state. The rhKGF sequence was used in CamSolintrinsic available at (http://www-vendruscolo.ch.cam. ac.uk/camsolmethod.html) to analyze its intrinsic solubility.

WALTZ
WALTZ is a computer algorithm for the amyloid prediction in proteins which is capable of distinguishing the true amyloids from the amorphous aggregates. WALTZ is also able to identify hard-to-predict amyloid aggregates like those involved in prion disease (Oliveberg, 2010) . WALTZ has a three-component scoring function: a position-specific scoring matrix (PSSM) for hexapeptides to identify amyloid-forming regions using the AmylHex dataset which is supplemented with a large number of experimentally determined amyloidogenic and non-amyloidogenic sequences, the physicochemical properties of amino acids including hydrophobicity and b-structure forming propensity, and a position-specific pseudoenergy matrix determined from the crystal structure of Sup35 GNNQQNY peptide which forms the cross-b spine of amyloid fibrills to estimate the relative energies using FoldX (Ahmed & Kajava, 2013;Meric et al., 2017). WALTZ algorithm have been trained by a large set of experimentally validated amyloid-forming peptides with eighty percent of the data provided has been validated with technologies like circular dichroism, electron microscopy and infrared spectroscopy, Fourier-transform infrared spectroscopy (FTIR) and Thioflavin-T binding assays (Meric et al., 2017).
The amyloid-forming peptides can be predicted with each of the three existing thresholds in WALTZ database including "Best overall performance", "High specificity", and "High sensitivity" or with the customized threshold defined by the user. To analyze the rhKGF using WALTZ, its amino acid sequence in FASTA format was submitted to the online WALTZ web server (http://waltz.switchlab.org/) and APR prediction was made with different thresholds.

PASTA
PASTA is a sequence-based server to identify the propensity of different protein sequences for amyloid aggregation, the effect of mutations on the aggregation rate, the aggregation hotspots establishing the pathological conditions in amyloidosis, and the portions stabilizing the cross-beta core of fibrillar aggregates. PASTA also provides information about the protein intrinsic disorder and secondary structure which complement the aggregation prediction (Walsh et al., 2014). PASTA makes aggregation predictions based on the assumption that mechanisms governing the native b-sheet formation, are also responsible for the formation of b-sheets in amyloid aggregates (Meric et al., 2017). The peptides with the energy level below a specified threshold are defined as aggregation hotspots. To analyze the rhKGF, its amino acid sequence was submitted to PASTA server available at http:// protein.bio.unipd.it/pasta2/.

SODA
SODA can estimate the protein solubility changes based on the protein physico-chemical properties like sequence-based aggregation propensity, intrinsic disorder, plus hydrophobicity and the secondary structure preferences. SODA applies PASTA aggregation propensity, ESpritz intrinsic disorder scores, Kyte and Doolittle hydrophobicity profile and FESS secondary structure (a-helix and b-strand) propensity to predict changes of protein solubility based on the physicochemical properties (Paladin et al., 2017). Furthermore, SODA can predict the mutations increasing or decreasing the solubility of the protein at all of the positions and also the effect of mutations created by the user on the protein solubility (Paladin et al., 2017). The SODA input should be in the FASTA or PDB format and the output is the graphical solubility profile of the wild type protein and the effects of variations on protein solubility. The server is available at http:// protein.bio.unipd.it/soda/. The SODA server was used to determine the low solubility regions of the rhKGF.

Amylpred
Amylpred utilizes a consensus of various strategies that have been found or explicitly developed to foresee the features related to the formation of amyloid fibrils. Amylpred has been useful to identify the amyloid-forming regions involved in the development of conformational diseases known as amyloidosis like Alzheimer's, Parkinson's, type II diabetes, and prion disease. Amylpred also enables us to identify the properties of protein folding/misfolding and to control the aggregation/solubility of biopharmaceuticals in biotechnology industry . AMYLPRED makes a consensus prediction of five methods including Average Packing Density, Possible Conformational Switches, Amyloidogenic Pattern, TANGO, and Hexapeptide Conformational Energy and is available at http://aias.biol.uoa.gr/AMYLPRED/. In the Amylpred, the hits are defined as the overlap of at least two out of five methods . To make a consensus prediction of APRs in rhKGF, the AmylPred server was used which makes 2.3.10. AMYPDB AMYPDB as a comprehensive amyloid protein database provides a pattern discovery and analysis tool enabling us to detect regions of significance in amyloid formation in any protein query. AMYPDB has been dedicated to the accumulation of the sequence and structure of the amyloidogenic protein families associated with several pathologic conditions such as Alzheimer's disease, prion disease, type II diabetes mellitus, Parkinson, Huntington, and Creutzfeldt-Jakob (Pawlicki et al., 2008).
AMYPDB is also able to make a consensus prediction of APRs using different APR detection tools including Aggrescan, PASTA, SALSA, Pafig, Fold-amyloid, and TANGO (Pawlicki et al., 2008). AMYPDB server, available at http:// amypdb.genouest.org/, was also used to make a consensus prediction of APRs in rhKGF.

SolubiS
SolubiS is a combined method for identification of APRs with high aggregation propensity and low thermodynamic stability using 3-D structure of the protein. It uses TANGO algorithm and FoldX empirical force field for prediction of b-aggregation-prone regions and protein stability, respectively. SolubiS makes distinction between the APRs that are thermodynamically protected from triggering aggregation by folding (structural APRs) and those occurring in the aggregation-competent conformations that form without major unfolding event (critical APRs). The latter can trigger protein aggregation under near-native conditions (van der Kant, 2019; van der Kant et al., 2017). The 3-D structure of rhKGF was submitted to SolubiS web-based tool available at http:// solubis.switchlab.org/ and stretch-plots representing the APRs (TANGO summed score) and their contribution to thermodynamic stability (FoldX summed DG of each APR), besides mutant aggregation and stability spectrum (MASS) plot showing the changes in aggregation propensity and thermodynamic stability upon mutation of APRs to different gate keepers were created.

CamSol structurally corrected
The second algorithm of CamSol, known as CamSol structurally corrected, applies a PDB file as input, provides structural correction to the intrinsic solubility, and calculates the protein solubility based on the amino acid proximity in the structure and regarding their solvent exposure (Sormanni et al., 2015). The rhKGF PDB prepared by homology modeling was used as the input file for CamSol structural analysis (http://www-vendruscolo.ch.cam.ac.uk/camsolmethod.html).

AGGRESCAN3D (A3D)
AGGRESCAN3D is a structure-based APR prediction approach developed to detect surface-exposed spatially-adjacent APRs involved in the aggregation of natively folded proteins in two static and dynamic states based on both amino acids position, and structure. In the A3D static mode calculation, the input structure is minimized with FoldX force-field, and then the AGGRESCAN intrinsic aggregation propensity score of each amino acid is modified based on its structural context. This procedure provides a structurally corrected aggregation score for each residue. In the dynamic mode of A3D, the input structure is first minimized by FoldX and applied as the starting structure for simulation of structural flexibility using CABS-flex protocol which is a fast and effective alternative to classic MD simulation. Then, the trajectory is being processed to generate a set of protein models reflecting the most dominant structural fluctuations. Finally, the A3D static mode calculation is performed for all CABS-flex generated models, and the model with the highest A3D score is presented in output as the most aggregation-prone structural variant of protein (Zambrano et al., 2015). AGGRESCAN3D server, available at http://biocomp.chem.uw.edu.pl/A3D/, was used to identify APRs in the rhKGF structure at the radius of 5 Å.

Spatial aggregation propensity (SAP)
The hydrophobic residues generally participate in the formation of the hydrophobic core of the proteins, increasing the protein stability but they will form aggregation hotspots when they cluster together on the protein surfaces. The spatial aggregation propensity algorithm, briefly known as "SAP", predicts protein structural regions which can form aggregates by making hydrophobic patches on the protein surface. Utilizing molecular dynamics (MD) simulation, SAP can also simulate the size change of the patches under physiological conditions. SAP identifies dynamically exposed hydrophobicity of the certain patches on the surface of the protein.
The SAP calculation procedure yields SAP scores for each atom, average SAP score for each residue and a SAP_mapped PDB file in which the residues are color-scaled based on the measured SAP values. A SAP value is calculated and assigned to each residue based on the sum of the effective hydrophobicity (Ueff) of the neighboring residues SAP atom i ¼ X Simulation average X Residues with at least one side chain atom within R from atom i Â SAA of side chain atoms within radius R SAA of side chain atoms of fully exposed residue ÂHydrophobicity i with a radius R. The effective hydrophobicity of each residue depends on both its intrinsic hydrophobicity and surface exposure (solvent accessible area; SAA). The aggregation propensity of each residue is calculated from the SAP values for each atom, SAP atom (Buck, 2012;Lee et al., 2013): The positive SAP scores indicate the aggregation-prone regions. Whereas measuring the SAP scores at low resolution or high SAP radius (> 10 Å) can find the large aggregation prone patches and does not provide a detailed view of these patches, the SAP values at high resolution or low radius (5 Å) can identify the sites to be mutated to mitigate protein aggregation. When the SAP values are measured at 5 Å, the residues with the scores above 0.15 are considered as hits (Courtois et al., 2016).

Molecular dynamics simulation of rhKGF.
Molecular dynamics (MD) simulation is a powerful tool for the characterization of the atomic-level behavior of the biomolecules like the protein motions and conformational changes in different physiological states (Courtois et al., 2016). The MD simulation was conducted with the 3-D structure of rhKGF obtained from homology modeling process using GROMACS software package (version 5.1.4). The rhKGF structure was solvated in a cubic box containing the transferable intermolecular potential with 3 points (TIP3P) water molecules with 10 Å distance between the protein and the edges of the solvation box. The solvated system was neutralized by adding Na þ and Clions. Then, the system was energy minimized in 1000 steps of steepest descent, and equilibrated in the canonical (NVT) and isothermal-isobaric (NPT) ensembles with the leapfrog algorithm and an integration time step of 0.002 picosecond (ps). The production run was performed for 100 ns.

Identification of the rhKGF dynamic SAP scores.
To measure the dynamic SAP scores of the rhKGF residues, the MD trajectory was converted into a pdb file containing 100 pdb(s) generated with the time interval of 1 ns (1000 ps) using GROMACS software. Then, the SAP values were computed at 5 Å and 10 Å for all 100 pdb files obtained from the MD trajectory based on the SAP calculation procedure developed by Dr. Naresh Chennamsetty using the CHARMM simulation program. Finally, the average SAP score of each residue was computed over the 100 ns MD production run. The SAP scores were also measured for the rhKGF 3 D structure obtained from homology modelling in the static state.

Integration of prediction methods
We integrated the result of APR prediction in rhKGF by different sequence-and structure-based tools on a graph to determine the consensus APRs. Furthermore, to distinguish the surface-exposed APRs from the buried ones mediating aggregation of rhKGF through native and non-native pathways, respectively, the consensus APRs were mapped onto the 3 Dstructure of rhKGF using PYMOL software version 2.4.1.

Retrieval of rhKGF sequence and structure
The sequence for rhKGF (Palifermin) in the DrugBank was 140 amino acids in length without the first methionine, consequently all of the analyses were performed with this sequence. In homology modeling using Galaxy web server, the refined model with the MolProbity score of 0.54 was chosen as the best one. MolProbity score is a log-weighted combination of clash score, percentage of Ramachandran outliers, and bad rotamers that gives a number reflecting the crystallographic resolution that those values are expected.

Sequence-based detection of APRs in rhKGF
Amyloid fibrils, as highly ordered fibrillar structures, are composed of the polypeptide chains arranged in a cross-b-sheet conformation. Many predictive algorithms have been developed to predict b-aggregation, and amyloid fibril-forming propensity of peptides and proteins (Garvey et al., 2013). Here, we tried to integrate most of the available predictive algorithms to predict the regions determining the cross-b aggregation and amyloid-forming propensity of rhKGF.
TANGO, as one such statistical algorithm, incorporates not only the peptide or protein sequence, but also the physicochemical parameters including pH, temperature, and ionic strength to calculate the cross-b aggregation propensity of proteins (Garvey et al., 2013). Analysis of rhKGF using TANGO identified three segments as potential APRs with the aggregation scores above the threshold (5%) including residues 48-56: TVAVGIVAI, 63-67: FYLAM, and 109-114: MFVALN which can participate in cross-b aggregation (Figure 2). The highest TANGO b-aggregation score belongs to the TVAVGIVAI peptide.
To distinguish the buried and exposed APRs predicted by TANGO, the rhKGF was analyzed using SolubiS server. The stretch-plot of rhKGF created by SolubiS representing the aggregation propensity and local stability of APRs showed that from the three APRs identified by TANGO, the regions 48-56, and 63-67 have positive summed DG and were predicted as critical (surface-exposed) APRs mediating nativestate aggregation and the region 109-114 with negative SolubiS score was defined as structural (buried) APR which participates in non-native aggregation (Figure 2).
According to the AGGRESCAN a4v numerical data (not shown) and hot spot area (Figure 3) prepared for the rhKGF, three aggregation hotspots have been recognized in the rhKGF including residues 43-58, 61-66, and 108-113. AGGRESCAN predicts these regions in the input polypeptide sequence by identification of the protein fragments which were experimentally proved to be involved in the aggregation of disease-linked proteins.
According to the 3 D profile approach implemented by ZipperDB and by threading the sequence of rhKGF on the crystal structure of NNQQNY peptide of the sup35 prion protein, the residues G29, T48-V54, G58, Y64, H93, T96, A98, E108, F110, and Q129 with the energy below À23 Kcal/mol were identified as amyloid-prone regions in rhKGF (Figure 4).
These regions can form the cross-beta spine of amyloid-like fibrils in rhKGF. Most of the APRs predicted by ZipperDB were common with those identified by TANGO and AGGRESCAN except for G29, H93, T96, A98, E108 and Q129.
The possibility of the fibril or protofibril structure formation and amyloid aggregation propensity of rhKGF was also predicted with Zyggregator. Zyggregator identified residues Y2, R18, T19, I43, V49-V54, I56, Y64, L65, D82, Y94-Y97, S99,   A100, T103-N105, A112, and T140 with the Z i agg score of one or higher as aggregation-prone residues ( Figure 5). Like TANGO, Zyggregator also considers the effect of physicochemical conditions including pH, temperature, ionic strength, and concentration of trifluorethanol on aggregation. Zyggregator is also able to predict the local instabilities of the structured proteins resulting in aggregation using the CamP program (Chiti et al., 2002;Meric et al., 2017). Zyggregator also evaluates the flanking residues ("gatekeepers") of a sliding window regarding the presence of charged residues which may decrease aggregation.
According to the Camsol structurally-corrected solubility profile, the residues V51, G52, and T140 have structurally-corrected solubility scores of À1.97, À1.05 and À1.43, respectively (Figure 7). These are the solvent-exposed poorly soluble residues of rhKGF which have been determined through structural correction to the CamSolintrinsic scores.
The result of amyloid aggregates prediction in rhKGF using WALTZ algorithm are presented in Table 1 and Figure 8. Most of the amyloidogenic regions predicted by WALTZ were common with the other APR prediction tools. The least conserved The green-colored line shows the threshold for aggregation tendency. Residues with scores above the threshold are predicted to participate in the formation of fibrillar aggregates. These residues are Y2, R18, T19, I43, V49-V54, I56, Y64, L65, D82, Y94-Y97, S99, A100, T103-N105, A112, and T140. Figure 6. The intrinsic solubility profile of rhKGF created by CamSol method based on the amino acid sequence. The regions with the scores more than þ1 are highly soluble (dark blue) and the regions with the solubility scores below -1 (shown in orange) are identified as poorly soluble ones. The amino acids with the score below -1 in rhKGF are R18, T19, Q20, Y22, A66, V49-V54, I56, F110-A112, F134-A138, and T140. Figure 7. The solvent-exposed poorly soluble amino acids in rhKGF with the CamSol structurally-corrected solubility scores below -1 are shown with the black arrows and marked with the yellow color. These residues are the candidate positions for mutations to enhance solubility. The structurally-corrected solubility scores are calculated from the intrinsic solubility scores based on the proximity of the amino acids in the 3 D structure and their solvent exposure. APR defined by WALTZ is "HYNTYA" which has only a few common residues with those detected by ZipperDB, Zyggregator and SALSA, but not with the other tools. The PASTA aggregation and disorder profile shows the aggregation probability versus disorder probability at residue k in self-aggregation mode. The pairing free energy of residue k with residue m is presented as free energy (k, m) in which the k and m letters represent the first and second amino acids of a pair, respectively (Raja & Kinne, 2020). The PASTA aggregation profile provides a normalized per-residue aggregation probability as h (k). This parameter is calculated from the h (k) equation (Equation (5)) (Trovato et al., 2006) and shows that which regions of the protein have more aggregation tendency. The h 2 (k, m), calculated from Equation (6) is the normalized two-dimensional pairing probability of sequence which is determined from the sequence self-alignment (Trovato et al., 2006). According to the PASTA aggregation and disorder profile, aggregation free energy profile and aggregation-pairing matrix (Figure 9) prepared for rhKGF, the portion 40-60, owing to the lowest aggregation free energy, is the most amyloid-stabilizing region responsible for rhKGF amyloid-forming propensity. Furthermore, this area has the highest pairing probability which is responsible for protein self-association.
According to the SODA solubility profile, the low solubility regions of rhKGF are located at positions 40-60, 87-89, and 108-112 ( Figure 10). SODA predicts protein solubility based on disorder and aggregation which both of them are affected by the physicochemical properties of the amino acids including hydrophobicity, secondary structure preferences, and charge.

Structure-based detection of APRs in rhKGF
Structure-based prediction of APRs using Aggrescan3D identified residues Y2, M5, I10, V51, V59, Y74, L88, I89, L90, and I139 as spatially-adjacent APRs involved in the aggregation of rhKGF (Figure 13). A3D score is the structurally-corrected aggregation value deduced from AGGRESCAN score in which the surface exposure of amino acids and the residue distance from center were also considered.
The SAP algorithm have been designed to identify the dynamically exposed hydrophobic patches on the surface of the protein which their interactions with the other hydrophobic patches on the neighboring molecules in the native state of the protein lead to the accumulation and aggregation of the protein molecules (Buck, 2012;Lee et al., 2013). Figure  14 demonstrates the result of Spatial Aggregation Propensity (SAP) calculation for rhKGF. The residue SAP scores and the SAP mapped PDB files showed that almost half of the rhKGF residues have positive SAP scores forming the hydrophobic exposed patches on the surface of rhKGF which make it prone to aggregate in the aqueous media. The SAP mapped PDB files were visualized and presented by the Visual Molecular Dynamics (VMD) software (version 1.9.3). The color intensity indicates the degree of hydrophobicity or hydrophilicity. The graph of SAP values was prepared by the GraphPad Prism software (version 8.3.0). The Dynamic SAP calculation has identified nine residues with the SAP scores above 0.15 at R ¼ 5 Å on the surface of the rhKGF which are considered as potential APRs. The static SAP calculation defined the same residues as those identified by dynamic calculation as high SAP residues, except for the residue Y2 which had a static SAP score of 0.15, but an average dynamic SAP score of 0.14.

Discussion
As previously mentioned, high aggregation tendency of rhKGF, manifested by loss of the monomeric state and accumulation of the aggregated species at moderate temperatures, has made it a fairly unstable protein with limited pharmaceutical applications. Optimization of the formulation/storage condition e.g. using different stabilizers (heparin), osmolytes (like trehalose and sucrose) and salts (like ammonium sulfate, and sodium chloride) applied for KGF (Chen et al., 1994) and other FGFs (Benington et al., 2020) may be implemented for in vitro applications, but in vivo applications are limited to the naturally occurring solution environment. For the latter, aggregation management can be obtained by designing aggregation-resistant variants. For Figure 8. The amyloidogenic regions identified in Palifermin using WALTZ. The best overall performance threshold has identified four regions as APR in rhKGF including 18-22, 39-47, 52-57, and 93-98. example, stable mutants of FGF2 (bFGF) were designed by sequence alignment with FGF1 stabilized mutants. Furthermore, the structural free energy analysis to enhance conformational stability of FGF2 have resulted in nine point mutants with increased in vitro functional half-life from 10 h to more than 20 days at 37 C (Benington et al., 2020). Different groups of mutations including deletion, cysteine substitution, charge substitution, and combined mutations were also assessed on native KGF and the D23 analog with $2 fold increase in half-life to 1.2 days with optimal activity was approved by FDA for clinical applications (Hsu et al., 2006). Increasing the protein conformational stability protects the protein from unfolding and the APRs buried in the hydrophobic core of the protein from triggering aggregation and therefore reduces the protein aggregation tendency (van der Kant et al., 2017). Characterization of the sequence/ structural APRs in proteins may help mitigate aggregation in biotherapeutics via engineering the surface-exposed APRs (either in the native or unfolded state) and selection of drug candidates with reduced aggregation tendency (Meric et al., 2017).
A number of computational methods have been developed to identify the sequence aggregation-prone regions in the amyloidogenic proteins. Some of these algorithms like TANGO, AGGRESCAN, AGGRESCAN3D, CamSol, and PAGE focus on prediction of b-aggregation prone regions, but not necessarily the amyloid ones. Some of the other algorithms like PASTA, SALSA, Waltz, Zyggregator, 3 D profile, Pafig, Fold-amyloid, NetCSSP, AmyloidMutants, Amyloidogenic Pattern, and SecStr have focused on both prediction of b-aggregation and amyloid-fibril prone regions. Here, we tried to analyze the rhKGF with most of the publicly available tools applying different algorithms to prevent biases from the training sets, parametrization, and the specific characteristics of any given method. This approach helped us to make a prediction of APRs in rhKGF with higher confidence. We also measured the dynamic exposure of the hydrophobic patches in the rhKGF using SAP tool to identify structural APRs mediating native-state aggregation.
Previous studies recognized several protein conformations as the seed of protein aggregates at the molecular level. The most studied conformational seed is the cross b-motif observed in the amyloid-like fibrils. The cross b-motifs composed of intermolecular b-sheets form the spine of amyloids aggregation as highly ordered aggregates (Nelson et al., 2005). Siepen et al proved that the exposed side b-strands may also give rise to amyloids by forming the cross b-motifs as seen in transthyretin and b 2 -microglobulin (Siepen et al., Figure 9. Prediction of amyloid structure aggregation in rhKGF primary structure using PASTA. According to aggregation and disorder profile (A), aggregation free energy profile (B), and aggregation-pairing matrix (C) the portion 40-60, having the lowest aggregation free energy, is the most amyloid stabilizing region responsible for rhKGF amyloid-forming propensity. The graph (C) shows the paring probability important for protein self-aggregation. The darkest area belongs to the region 40-60 and has the most paring probability. Figure 10. The graphical solubility profile of rhKGF prepared by SODA server. The low solubility regions (the aggregation hits) are shown in blue. Figure 11. The consensus results of APR prediction with AMYLPRED server. The prediction is based on the overlap of two out of five methods. The hits (15-25, 43-56, 63-67, 88-91, 109-114, and 133-139) were presented with the Ã symbol.  Figure 13. A) A3D structure-based analysis of the rhKGF aggregation propensity in the dynamic state. The protein surface has been colored according to the A3D score in gradient (from Red for high aggregation propensity to white for negligible effect on the aggregation propensity to blue for high solubility regions). B) The A3D profile of the rhKGF. The positive and negative scores are correlated to aggregation-and solubility-prone residues, respectively. The residues with the highest A3D score involved in the formation of the aggregation-prone patches of the rhKGF folded state have been marked. Figure 14. Spatial aggregation propensity for the rhKGF (Palifermin). (A) The SAP values at R ¼ 5 Å are mapped onto the rhKGF structure. (Note: Due to the difference in residue SAP values and therefore color intensity in different SAP_mapped PDBs from dynamic SAP calculation, the SAP_mapped PDB obtained from the static SAP calculation was used to show the SAP values on the structure and it is the reason that the color intensity and SAP scores are not matched exactly.). The regions with the positive SAP scores (more hydrophobicity and aggregation propensity) are shown in red and the areas with the negative SAP values (less hydrophobicity and aggregation propensity) are represented in blue. The color intensity shows the degree of hydrophobicity or hydrophilicity. The residues with the SAP values above 0.15 at R ¼ 5 Å which are considered as the candidate positions for mutation are located. (B) Table of residues with the average dynamic SAP scores higher than 0.15 at R ¼ 5 Å, and the corresponding SAP scores. (C) The plot of average dynamic SAP values of the rhKGF residues at R ¼ 5 Å. The residues with SAP scores greater than 0.15 are marked. Figure 15. (A) The aggregation propensity of rhKGF was analyzed using different sequence, structure-based and consensus APR predictors. The APRs in rhKGF have been indicated with different colors for different predictors. The predictions showed that the APRs in rhKGF are mainly located at positions 10-30, 40-60, 61-66, 88-120, and 130-140. Similar to APRs found in the other amyloidogenic proteins, rhKGF APRs are rich in b-branched aliphatic, hydrophobic, aromatic and in some regions Q/N residues. While the APRs identified by the sequence-based predictors might be buried or exposed in the folded protein, the structure-based predictions made with the Aggrescan3D and SAP limited the number of identified APRs to the dynamically-exposed hydrophobic residues including V12, A50, V51, L88, I89, L90, I118, L135, and I139 mediating the native-state aggregation. (B) The modeled crystal structure (space-filling model) of the rhKGF. Mapping the computationally predicted APRs found by the sequence-based tools on the tertiary structure of the rhKGF showed that most of the APRs are solvent-exposed in the natively folded protein including F16-R25, I43, E45, R47-I56, F61, Y62, N66, L88-E91, E108-F110, A112, N114, T131, and H133-T140.
2003). Moreover, Dobson's group showed that fibril formation is the general properties of proteins under stress conditions (Chiti et al., 2002). The rhKGF is also a b-rich protein motivating us to explore if the cross b-motif conformation promotes its aggregation. To examine the b-sheet aggregation propensity of rhKGF, it was analyzed using different algorithms of which TANGO, AGGRESCAN, AGGRESCAN3D, and CamSol have been dedicated to identification of cross b-motifs in proteins. TANGO calculations revealed that rhKGF contains three motifs with the aggregation scores above the threshold. The highest aggregation score belonged to the region TVAVGIVAI (48-56) implying that this peptide has the highest aggregation propensity in rhKGF. The existence of one such aggregation-prone motifs can be sufficient for cross-b motif-derived aggregation, but it does not necessarily guarantee that the protein aggregation proceeds by this mechanism (Wang et al., 2009).
Using TANGO APR prediction algorithm, SolubiS is also able to define the motifs involved in cross-b aggregation of proteins. SolubiS combines the TANGO APR prediction results with FoldX free energy estimation to make distinction between the structural and critical APRs. Among the APRs detected by TANGO, the regions 48-56, and 63-67 positive summed DG were characterized as critical APRs which are the able to mediate aggregation under native conditions. The least stable APR with the highest TANGO score is the region 48-56 which is also involved in the receptor binding and it can be problematic when the protein is not engaged in the protein-receptor interaction. Previous studies showed that mutation of the critical APRs can reduce the overall aggregation tendency and increased amount of soluble protein produced in mammalian cells (van der Kant et al., 2017). The region 109-114 has the lowest SolubiS score due to its high contribution to local thermodynamic stability and therefore is considered as a structural APR which is naturally protected from triggering aggregation by folding. Increasing the conformational stability can protect such APRs from exposure and contact with the similar sequences on the neighboring molecules and prevents non-native aggregation (Meric et al., 2017;van der Kant et al., 2017).
The APRs predicted by AGGRESCAN were highly similar to those predicted by TANGO. The similarity was expected as both tools use the same approach in the characterization of APRs i.e. predicting b-aggregation prone regions, but not necessarily the amyloid ones (Conchillo-Sol e et al., 2007). AGGRESCAN3D as the optimized version of AGGRESCAN which is capable of distinguishing the buried from surfaceexposed APRs, is also a predictor of b-sheet aggregation. AGGRESCAN3D identified the spatially-adjacent APRs involved in the aggregation of rhKGF from the native state by structural correction applied to the AGGRESCAN score. AGGRESCAN3D has exhibited higher accuracy compared to the sequence-based algorithms in forecasting the aggregation hot spots of the globular proteins (Zambrano et al., 2015).
CamSol is also able to provide predictions of protein b-aggregation propensity. However, unlike the other algorithms which may only focus on amyloid formation, CamSol also provides predictions of protein solubility (Sormanni et al., 2015). The intrinsic solubility profile of the rhKGF prepared by CamSol identified the poorly soluble and aggregation-prone residues in rhKGF. The results of APR prediction in rhKGF using camsol are largely consistent with TANGO and AGGRESCAN and the most conserved regions are VAVGIV (V49-V54) and FVA (F110-A112). These residues are the most important candidate positions for designing variants of rhKGF with enhanced solubility. Like AGGRESCAN3D, Camsol structurally-corrected solubility algorithm applies the structural correction to the CamSolintrinsic scores to identify the solvent-exposed poorly soluble amino acids. CamSol employs the variables similar to those implemented in Zyggregator with the difference in the parameters and definition of the gatekeeping effect (Sormanni et al., 2015).
ZipperDB, Zyggregator, WALTZ, PASTA, SODA, AMYLPRED, and AMYPDB are servers which have been developed to predict both cross-b aggregation and amyloid-forming propensity of peptides and proteins (Siepen et al., 2003). Among these tools, ZipperDB is unique regarding its prediction approach since it makes use of the structural information to find the probability of fibril formation for a particular sequence (Sawaya et al., 2007). Zyggregator predicts the possibility of the fibril or protofibril structure formation and amyloid aggregation propensity based on the combination of physico-chemical properties of the amino acid content of the protein including hydrophobicity, a-helical and b-sheet propensity, hydrophilic/hydrophobic patterns and net charge of the polypeptide (Meric et al., 2017). WALTZ has the highest accuracy, since its data have been validated by the experimental methods. No other existing amyloid prediction tools has such a high validation rate (Maurer-Stroh et al., 2010). The unique feature of PASTA and SODA is considering the intrinsic disorder of the protein as a major determinant of protein aggregation. Because the highly dynamical disordered regions of protein can increase its aggreagation propensity (Chiti, 2003).
AMYLPRED and AMYPDB are also referred as metapredictors due to their analogy to the metasearch engines. The rational in favor of consensus strategies is that different strategies consider different contributors in the estimation of protein aggregation propensity. Since it is still unknown that which contributors are most important, and various algorithms weigh these contributors differently, using metapredictors in which their members complement each other may enhance prediction accuracy. It is also more likely that applying consensus methods with the algorithms trained with different data sets may result in less false positive results (Meric et al., 2017;Tsolis, 2013). The APRs identified by AMYLPRED and AMYPDB may be considered as representative of the result of other sequence-based APR prediction tools. Despite the differences in the algorithm and parameters used in different tools, it can be concluded that the regions 10-30, 40-60, 61-66, 88-120, and 130-140 play the most important roles in amyloid-forming propensity of rhKGF. These regions can be considered as the candidate positions to create the mutants with reduced aggregation propensity.
For a given aggregation-prone region to promote aggregation, it should have a high intrinsic aggregation propensity, be surface-exposed or become exposed after conformational transition like complete/partial unfolding or misfolding events to facilitate intermolecular interactions. Therefore, the presence of sequence/structural APRs may be essential but is not sufficient for protein aggregation (Chennamsetty et al., 2010). The sequence-based APR prediction tools discussed above, are only capable of distinguishing sequentially contiguous residues as APR in small proteins. These tools generally assume that an unfolding event is required to expose and make the generally hydrophobic hot spots with a high b-sheet forming propensity available for contacting the analogous sequences on the neighboring molecules. The APRs which are exposed after an unfolding event mediate protein aggregation through non-native pathway (Meric et al., 2017). Hence, one proposed solution to reduce protein aggregation is to enhance the conformational stability of the protein by increasing the free energy of unfolding (DGunf) (Meric et al., 2017). Using structure-based APR detection tools like SolubiS, Structurally corrected version of CamSol, Aggrescan3D, and SAP enabled us to make distinction between APRs mediating native (aggregation proceeding from the native monomeric state of the protein) and non-native aggregation (aggregation after partial/complete unfolding event). Aggrescan3D (A3D) and SAP are capable of detecting noncontiguous surface-exposed APRs in the large natively-folded globular proteins like monoclonal antibodies, and therefore recognizing APRs involved in the native-state aggregation (Meric et al., 2017;Nelson et al., 2005).
The Dynamic SAP calculation has identified nine residues with the SAP scores above 0.15 at R ¼ 5 Å on the surface of the rhKGF. The high sap residues are considered as potential APRs and are the candidate positions for mutations against aggregation. There was no significant difference between the result of SAP calculation in the static, and dynamic states. According to the SAP algorithm, these are the hydrophobic residues which may expose under the normal dynamic fluctuations (Chennamsetty et al., 2010). These fluctuations may combine them with the other surface-exposed hydrophobic residues and lead to the formation of larger surface-exposed hydrophobic patches. The interaction of the hydrophobic patches on the neighboring molecules in the native state of the protein leads to the accumulation and aggregation of the protein molecules (Buck, 2012;Lee et al., 2013). SAP considers both dynamic exposure and spatial proximity of each residue in the protein tertiary structure and is also applicable for large biotherapeutics like antibodies. In the SAP algorithm like Aggrescan3D, the unfolding event is not considered as a prerequisite of aggregation, and APRs are identified in the native state of the protein (Chennamsetty et al., 2010). As the hydrophobic interactions have been shown to be the predominant mechanism of aggregation, identification of the dynamically exposed hydrophobic patches of the protein with the SAP tool, and mutation of the high SAP residues; indicators of aggregation hot spots, to the residues with the less SAP scores can reduce the protein aggregation tendency (Wang et al., 2009). The APRs detected by the sequence-based APR detection tools might be buried inside the protein and only become exposed after an unfolding event or be surface exposed in the native state. Mapping these APRs on the native structure of the rhKGF revealed that most of them are solvent-exposed in the natively folded protein including F16-R25, I43, E45, R47-I56, F61, Y62, N66, L88-E91, E108-F110, A112, N114, T131, and H133-T140. Therefore, these regions may promote aggregation of the rhKGF without the occurrence of a significant unfolding event preceding the protein aggregation, or the conformational change may occur within the oligomers composed of the folded monomers (aggregation-competent conformations) (Meric et al., 2017). The other regions are buried in the native state and contribute to non-native aggregation by a preceding unfolding event in the monomeric state of the protein. Increasing the protein conformational stability can prevent these regions from triggering aggregation. The structurebased predictions limited the number of identified APRs to the dynamically-exposed hydrophobic residues including V12, A50, V51, L88, I89, L90, I118, L135, and I139 mediating the native-state aggregation. The FGF7 mutational analysis has indicated that some of these regions including M37-K57, which has been predicted as APR by most of the sequencebased APR detection tools, is important for the interaction of all FGFs with FGF receptors (FGFR) with the RTVAV motif (residues 47-51) being specific for FGF7-FGFR2IIIB interaction. The residues N95, Y97, L135, and M137 are also important for all FGF-FGFR interactions, and the residues D9, R11, and L88 are specific for KGF-KGFR interaction [44]. Furthermore, some of the APRs identified in rhKGF are also located in the heparinbinding sites including R11, R13, R14, Q20, K27, K101, and R121. Mutation of the residues involved in receptor/heparinbinding may disrupt the KGF biological activity (Chiti, 2003;Maurer-Stroh et al., 2010). Except for these regions, the other residues can be substituted with either hydrophilic or even less hydrophobic residues to design biologically active analogs of rhKGF with reduced aggregation propensity.

Conclusion
The aggregation propensity of rhKGF was analyzed using both sequence and structure-based APR predictors. Accordingly, the high intrinsic aggregation propensity of rhKGF is mainly mediated by the amino acids located at positions 10-30, 40-60, 61-66, 88-120, and 130-140. The aggregation-prone motifs found in the rhKGF were rich in the hydrophobic residues including Gly, Ala, Val, Leu, and Ile, and aromatic residues like Phe, Tyr and Trp which have high b-sheet forming propensity. Some motifs also contained Asn (N) and Gln (Q) which have been demonstrated to be involved in the formation of amyloid aggregates. Charged residues have been rarely observed in these motifs. Using sequence-and structure-based APR prediction tools enabled us to distinguish buried and exposed APRs mediating nonnative and native aggregation of rhKGF, respectively. As most of the predicted APRs in rhKGF, even those identified by the sequence-based tools are solvent-exposed in the native state of the protein, it can be concluded that the non-covalent aggregation of rhKGF mainly occurs through the native pathway. Limiting the rate of aggregation can be achieved via increasing conformational stability or mutating the APRs to the less aggregation-prone residues. Some of these regions contribute toward receptor/heparin binding. This is a significant challenge to design variants of biotherapeutics with low vulnerability toward aggregation and unchanged biological activity. However, except for the functional residues, the others can be substituted with either hydrophilic or even less hydrophobic residues to design active analogs of rhKGF with reduced aggregation propensity. Furthermore, the structure-based rational design (e.g. mutation of high SAP-valued residues which are not involved in receptor/heparin binding) as a part of QbD approach may also help reducing the total aggregation propensity without compromising the biological activity.