PepQSAR: a comprehensive data source and information platform for peptide quantitative structure–activity relationships

Peptide quantitative structure–activity relationships (pQSARs) have been widely applied to the statistical modeling and empirical prediction of peptide activity, property and feature. In the procedure, the peptide structure is characterized at sequence level using amino acid descriptors (AADs) and then correlated with observations by machine learning methods (MLMs), consequently resulting in a variety of quantitative regression models used to explain the structural factors that govern peptide activities, to generalize peptide properties of unknown from known samples, and to design new peptides with desired features. In this study, we developed a comprehensive platform, termed PepQSAR database, which is a systematic collection and decomposition of various data sources and abundant information regarding the pQSARs, including AADs, MLMs, data sets, peptide sequences, measured activities, model statistics, and literatures. The database also provides a comparison function for the various previously built pQSAR models reported by different groups via distinct approaches. The structured and searchable PepQSAR database is expected to provide a useful resource and powerful tool for the computational peptidology community, which is freely available at http://i.uestc.edu.cn/PQsarDB.


Introduction
Bioactive peptides possess various metabolic and physiological regulatory functions in human body and have been implicated in the therapeutic purpose of various diseases and disorders (Abd-Talib et al. 2022). They are easily digested and absorbed, and can generate considerable benefits in the design and development of new peptidic drugs. In addition, various protein-protein interactions (PPIs) in the human interactome are directly or indirectly mediated by short peptide segments, namely peptide-mediated PPIs (PMIs), which are normally weak and transient, and thus ideal for controlling fast biological events, and easy to be formed and disrupted (Petsalaki and Russell 2008). Over the past decades, peptide quantitative structure-activity relationships (pQSARs) have been widely used to model and predict the diverse in vitro and in vivo behaviors of peptides in a statistical point of view, including inhibitory activity, binding affinity, physiochemical property and biological function. The pioneering work of pQSARs was done by Sneath et al. in 1960s (Sneath 1966, who characterized peptide sequences using semi-quantitative experimental parameters of 20 natural amino acids; this was the first attempt of applying amino acid descriptors (AADs) to parameterize peptide's structural feature at sequence level. Over the past decades, pQSARs have been widely spread in the biology and medicine areas, which are becoming as an important branch of the computational peptidology (Zhou et al. 2013a) and have 1 3 been successfully employed to, for example, develop protein-peptide affinity scoring predictor (Zhou et al. 2022;Li et al. 2019), guide rational peptide library evolution (Zheng et al. 1998;Jing et al. 2013), and analyze functional peptide property (Zhou et al. 2013b).
Peptides can be characterized with respect to their primary sequence and three-dimensional structure; the former is usually fulfilled with AADs (local descriptors), which are the few informative latent components extracting from a large set of amino acid properties, while the latter regards a peptide as a stereochemical molecule and employs conventional molecular descriptors (global descriptors) to parameterize the peptide (Zhou et al. 2008a). The AADs are standard pQSAR characterization strategy, which are normally derived from a variety of original amino acid properties such as topological, physicochemical, structural and quantum-chemical parameters using multivariate statistical techniques such as principal components analysis (PCA) and factor analysis (FA) (Westen et al. 2013). The characterization was then correlated with observed activities over a panel of peptide samples using machine learning methods to generate a number of statistical regression models and then validated by internal cross-validation and external blind test. Previously, we successfully developed several AADs, including T-scale (Tian et al. 2007), DPPS (Tian et al. 2008(Tian et al. , 2009) and VSW (Tong et al. 2008), which have been diversely applied to the statistical modeling, activity prediction, property analysis and rational design of a variety of bioactive peptides such as SH3-binding peptides (Zhou et al. 2008b), antioxidant peptides (Tian et al. 2015), antihypertensive peptides (Zhou et al. 2013b) and food peptides (Nongonierma and FitzGerald 2016). Recently, we also performed systematic comparison and comprehensive evaluation of 80 AADs in pQSAR modeling and concluded that since there are various AADs available to date and they already cover numerous amino acid properties, further development of new AADs is not an essential choice to improve peptide QSAR modeling; the traditional AAD methodology is believed to almost reach its upper limit nowadays (Zhou et al. 2021).
The peptide databases and online servers are increasing rapidly over the past decades, such as EROP (Zamyatnin 2006), PepBank (Shtatland et al. 2007) and SwePep (Fälth et al. 2006) that focus on endogenous peptides, and BIOPEP that deposits peptides of food origin (Iwaniak et al. 2005). The first 3 use resources mainly from the area of life sciences; the fourth uses predominantly literature from the area of food science. SwePep contains the most advanced tools for identification of peptides. BIOPEP and PepBank may allow constructing the biological activity profile of peptide fragments. PepBank can be screened by BLAST or SSEARCH using protein sequence as a query (Shtatland et al. 2007). In addition, antimicrobial peptides (AMPs) are host defense molecules ubiquitous in the innate immune systems of invertebrates and vertebrates. The antimicrobial peptide database (APD) ) has been focused on AMPs with defined sequence and activity for a long time and it has included a total of 2619 AMPs till now. In addition, the ASPD database was special for biopanning data, which incorporated biopanning data from 195 screening experiments (Valuev et al. 2002). Peptide databases are also significant for drug discovery. For example, PK/DB (Moda et al. 2008) contains pharmacokinetic information for 1389 small compounds containing structurally diverse drug-like and lead-like molecules. Also, e-Drug3D (Pihan et al. 2012) is a 3D chemical structure database of drugs that provides a ready-to-use collection of screenable SD files for several drugs and commercial drug fragments. The biopanning data bank (BDB) includes biopanning data isolated from random peptide libraries constructed by diverse display technologies (He et al. 2016). However, the pQSAR data and models have not been addressed yet by the bioinformatics and relevant community. In the current study, we established a new database of Peptide Quantitative Structure-Activity Relationship (PepQSAR) as a comprehensive data source and information platform for pQSAR studies. The database consists of three blocks, namely AADs, peptide samples and published pQSAR models, representing a basic and useful resource in the QSAR community.

Data source
PepQSAR contains data for 261 published pQSAR models built by 80 AADs and 2583 bioactive peptide sample panels, supported by 337 experimental evidences resulting from more than 1,000 literatures (summarized in Tables 1 and 2, Fig. 1). Data in PepQSAR are freely accessible without limitations and can be downloaded in format as tab delimited flat files. As for modeling methods, multiple linear regression, principal component regression and PLS can be employed for linear modeling while quadratic PLS and neural network are fit for nonlinear modeling. The leave-one-out crosstest correlation coefficient (q 2 ) and root mean square error (RMSCV) are usually treated as a criterion for evaluating the stability and predictive power of the model. For the three statistical parameters of the fitted correlation coefficient r 2 , cross-validation q 2 and external prediction q ext 2 , the former is in turn a sufficient condition for the latter, i.e. the QSAM model must have high self-stability (q 2 ) to have excellent predictive power (q ext 2 ), and high self-stability requires strong fitting ability (r 2 ) as a guarantee. An excellent model should have both good fitting ability for internal samples and powerful generalization ability for any external sample (Golbraikh and Tropsha 2002).

Utility and discussion
Developing a web software based on B/S structure requires a combination of technologies. The client side is generally written using HTML, CSS, Java Script, and other technologies, and uses a browser to interpret the graphical interface and provide the user with the ability to browse or operate. The server side uses a web server to receive requests from the client, respond to the client with the results of the requests, use a database management system to store and manage the data required for business processing in the website and respond to the results in a timely manner through the server scripting language, and dynamically generate the content of the pages by accessing the database. Although there are various versions of web development technology components for developers to choose from, the open-source LAMP has received attention from the entire IT community for its high compatibility, low investment cost, and stable operation. Incorporating some of the best features of modern programming languages, combination of PHP, Apache and MySQL has become a standard for web servers.
PepQSAR was developed using a MySQL database, HTML (for Graphical User Interface), CSS (for style sheets), PHP (server-side scripting language) and Javascript (for displaying alert messages or data validation) in a manner that is transparent to the end user.

Functional requirements analysis
Functional requirements are the most basic requirements in web development, and they are the basic expectations of the users (Fig. 2). Functional requirement analysis is very critical and it changes from time to time because of the continuous development of the Internet. This website is developed using the B/S architecture so that users can use this website as long as they access the Internet on their own computers. Based on the overall demand analysis of the system, the requirements of the system are two main modules: (1) Web front-end module: (2) Backend function module: The administrator organizes the database data and updates the related latest information, mainly by adding and deleting data in a timely manner.
On the top left of the home page is the logo of the site, below the logo of the home page is the navigation bar of  OA calc (calculate oxytocic activity from the loadings for that amino acid that is in position 3) Thromboplastin inhibitor sequences IC 50 (activity with respect to two responses: activated partial thromboplastine time and thromboplastine time) GPCR-derived peptides K i (an affinity narrowed down to only include annotations with protein confidence score) Cationic AMP peptides MIC (minimal inhibitory concentrations for the series of 101 CAMEL-s against the listed microorganisms have been previously averaged to produce the mean antibiotic potency parameters) Antimicrobial peptides N (the number of Staphylococcus aureus killed within 2 h) Elastase substrates log (K cat /K m ) (K m was calculated from the Lineweaver-Burk plot obtained with substrate concentrations in the range 0.01-2.5 mM, K cat = v/[E]) Bovine lactoferricin-(17-31)-Pentadecapeptide MIC (minimal growth inhibitory concentration against S. aureus) Tachykinins RVD, RPA and HT (the isolated rat vas deferens, the rabbit pulmonary artery, and the hamster isolated trachea) Bradykinin potentiating pentapeptides logRAI (the logarithm of a relative potentiating activity index) Neurotensin analogues pK d (binding potency at the human neurotensin receptor and the rat neurotensin receptor were evaluated as equilibrium dissociation constants from radioligand binding assays) Pseudopeptides LD 50 (lethal median doses, Oncostatic activity on L1210 leukemia and acute toxicity) Pepstatin analogues Ki (the inhibition constants of porcine pepsin) SH3 binding peptide BLU (the Boehringer light unit derived from previous arbitrary light intensity assays following SPOT peptide synthesis) HLA-A*0201-restricted CTL epitopes pBL 50 (the half-maximal binding level which is the peptide concentration yielding the halfmaximal FI of the reference peptide in each assay) Peptide analogues of PG-1 log(10 3 /MIC) (the logarithm of minimum inhibitory concentration) Non-nucleotide reverse transcriptase inhibitors EC 50 (activity against 14 HIV mutants and HIV replication was quantified by measuring the EGFP fluorescence) Protease inhibitors Relative pIC 50 (difference between mutant pIC 50 and wild-type pIC 50 ) Inorganic-binding peptides Binding (%, binding affinity to mica surface) Periplasmic chaperone (FIMC dataset) EC 50 (ability to prevent FimC/FimH complexation was expressed as inhibition at 50 µM concentration)

Database block
The organization and architecture of PepQSAR database consists of three modules as shown in Fig. 3.
(1) Descriptor block: Above the search box is to choose the search method, for retrieving entries of interest, PepQSAR provides search tools of keyword searches, users can enter the name in the search box to switch to the search result page. If the user selects type search, there is a drop-down selection box for the user to click. PepQSAR will return search results as a sorted (2) Peptide block: Enter the active peptide name in the search box to switch to the search result page. The search result will display the active peptide information in the form of a list, which includes Name and Introduction. Click on Name to jump to the detailed information of the peptide, including name, brief introduction, method, modeling results and references. The details of field designing for peptide data sheet are listed in Table 4. QSAR results and peptide activity values can be viewed and downloaded.
(3) QSAR model block: Descriptors and peptides corresponding to each model are shown in this page. You can get relevant information by simply entering names of active peptides or descriptors in the search box. The search result will display the information Fig. 2 Development workflow, including data collection, requirements analysis, detailed design, integrated environment development, software testing, etc.

Fig. 3
The database architecture mainly contains three modules, namely descriptors, bioactive peptides and QSAR models, providing query and download operations in the form of a list, which includes Descriptor, Peptide, QSAR Result and References. The details of field designing for model data sheet are listed in Table 5.

Software testing
Software testing is an indispensable step after design and development of every program, because it means that the functions of the website need to be more precise. Software testing is a systematic project for developers to test the reasonableness and safety of the software according to the requirements. After completing the software testing, the developers also have to reprogram the software and modify the modules that are different from the intended logic and functions within a certain scope. Black-box testing is that without knowing how the program works, the developers enter some commands according to the requirement specification, and then sees whether the program receives the parameters and outputs the corresponding results in order. White-box testing is defined as testing each program interface after getting the code, based on the design and the requirements analysis given by the requirements person, including static and dynamic analysis methods, etc.
Based on a comprehensive understanding of the site's functionality, we performed a long period of testing for content accuracy, image and click accuracy. PepQSAR also received better feedback on the implementation effect, button accuracy and content accuracy after alpha test. The   Foreign key second test did not find any serious errors, and the bugs found had been modified and improved, so PepQSAR could be launched as scheduled. The literature is constantly being added to, so a strategy to maintain PepQSAR database is a necessity. At present, we have settled on manual, periodic literature searches to identify new publications that we can mine for QSAR data. It would be advantageous if journals that publish significant numbers of peptide QSAR studies could be persuaded to require submission to PepQSAR a requirement as is done for GenBank. We hope that as PepQSAR becomes more widely used that journals will require or at least strongly encourage authors to submit their research data directly to us.
In this fashion, researchers should be able to dispense with lengthy material searching and associated data collecting. This should significantly speed up the new research course. Meanwhile, we will keep trying to implement the more advanced features and rendering effects of the database.

Conclusions
The PepQSAR database is a comprehensive data repository of amino acid descriptors and peptide QSARs with a web-based user interface. The built-in tools for browses, searches and interactive displays of study data make Pep-QSAR a useful resource for further identifying potential amino descriptor and developing computational tools for predicting activity of peptides. In addition, we provide free download of the data and categorized files for all the education and research users. We plan to update and maintain PepQSAR regularly, just like our previous work. We will always welcome users to contact us at any time if they have any questions about PepQSAR. Likewise, we would be extremely grateful if any readers would like to provide some latest data and information on pQSAR study.