Backgground: DNA mimic proteins are relatively obscure control factors that share some similarities with DNA. They emulate the negative charge distribution of the phosphate backbone of DNA by using negatively charged amino acids, namely, aspartic acids (ASP or D) and glutamic acids (GLU or E). Known DNA mimic proteins control various cellular mechanisms, such as transcription, DNA repair, and gene regulation, by intervening in the binding of DNA to effector proteins. In addition to being functionally important, DNA mimic proteins have potential for use in biotechnological applications. For example, the DNA mimic protein AcrIIA4 from Listeria monocytogenes prophages can improve the accuracy of gene editing by controlling the activity of CRISPR-Cas9. Therefore, DNA mimic proteins warrant further research. However, most DNA mimic proteins cannot be identified using traditional bioinformatics methods owing to their unique amino acid sequences and structural features.
Results: We developed a new protein fingerprint, called relative distance protein fingerprint (RD-PFP), that can be used to analyze the distribution of amino acids on a protein surface. We optimized our RD-PFP by using machine learning and the characteristic feature of DNA mimic proteins (namely, their DNA-like negative charge distribution) to more accurately predict DNA mimicry from protein sequences.
Conclusion: Our pioneering study contributes to the development of machine learning--based bioinformatics methods for screening DNA mimic proteins.