2D-Qsar for 450 types of amino acid induction peptides with a novel substructure pair descriptor having wider scope
© Osoda and Miyano; licensee Chemistry Central Ltd. 2011
Received: 25 February 2011
Accepted: 2 November 2011
Published: 2 November 2011
Quantitative structure-activity relationships (QSAR) analysis of peptides is helpful for designing various types of drugs such as kinase inhibitor or antigen. Capturing various properties of peptides is essential for analyzing two-dimensional QSAR. A descriptor of peptides is an important element for capturing properties. The atom pair holographic (APH) code is designed for the description of peptides and it represents peptides as the combination of thirty-six types of key atoms and their intermediate binding between two key atoms.
The substructure pair descriptor (SPAD) represents peptides as the combination of forty-nine types of key substructures and the sequence of amino acid residues between two substructures. The size of the key substructures is larger and the length of the sequence is longer than traditional descriptors. Similarity searches on C5a inhibitor data set and kinase inhibitor data set showed that order of inhibitors become three times higher by representing peptides with SPAD, respectively. Comparing scope of each descriptor shows that SPAD captures different properties from APH.
QSAR/QSPR for peptides is helpful for designing various types of drugs such as kinase inhibitor and antigen. SPAD is a novel and powerful descriptor for various types of peptides. Accuracy of QSAR/QSPR becomes higher by describing peptides with SPAD.
Research on the classification of small molecules using computers was popular in the 1990s [1–5], with similarity analysis of compounds being a major objective. At the time, there were mainly two methods for similarity analysis: the fingerprint description approach [4, 6] and the inductive logic programming approach [7–9]. In the fingerprint description approach, a molecule is described as a sequence of bits, each of which corresponds to the existence of a chemical substructure. Atom-pair descriptor  or substructure type fingerprints are popular descriptors.
Research on the classification of peptides became popular in the year 2000 [10–12]. The hidden Markov model (HMM) approach  and physical data description of peptide approach  were the major approaches. The main subject of these papers is the natural twenty amino acids, such as isoleucine, valine, and so on. For example, the subject of immunity concerns peptides whose components are one of 20 natural amino acids. In traditional research for the classification of peptides, an amino acid residue was described as an alphabet or a set of physical or chemical values .
However, in practical virtual screening, describing other amino acid inductions such as cyclohexyl alanine or F5 phenylalanine is necessary. The traditional description of peptides is not sufficiently powerful because the common characteristics among amino acid residues cannot be described sufficiently. For example, tyrosine and phenylalanine have an aromatic ring substructure in common. In the alphabetic description, tyrosine and phenylalanine are described as 'Y' and 'F' respectively. However, understanding that symbols 'Y' and 'F' have a common substructure on a machine learning algorithm is impossible. Research of two-dimensional QSAR has been undertaken for various types of peptides. In the atom-pair holographic code (APH) , each peptide is described with the method similar to atom-pair descriptor . Our novel descriptor, substructure-pair descriptor (SPAD), captures different characteristics of peptides from APH and has greater descriptive power than APH. The combination of APH and SPAD may lead to better QSAR for peptides with many types of amino acid inductions .
Tanimoto coefficient becomes large when two vectors have more similar bit-pattern. When the structure of two compounds is similar, Tanimoto coefficient is also high.
In machine learning, excessive features degrade the performance of machine learning algorithms due to over-fitting problems . Under excessive feature space, predictive models lose robustness. Feature selection is necessary for building more accurate predictive models. Kohavia proposed the relevance of features instead of maximizing accuracy of an algorithm . Discussions about relevance of features are popular in various types of algorithm . Relevance is defined as the difference between probability density function P(Y = y) and conditional probability density function P(Y = y|X i = x i ). When P(Y = y|X i = x i ) ≠ P(Y = y), X i is relevant. Otherwise, X i is irrelevant.
Definition of several terms
In this paper, we define several terms as follows.
Substructure: a part of structure of peptides
Descriptor: The function for mapping a structure of amino acid residues or peptides to a bit according to substructure.
Feature: A bit as the result of a descriptor.
A target protein binds some amino acid residues of peptides by some kinds of chemical or physical interactions. For example, hydrogen bonds and hydrophobic effect are representative interactions. In our QSAR approach, we describe the two-dimensional structure of peptides with a sequence of bits and analyze the relationship between peptides structure and its activity statistically. When we analyze this relationship with a data mining algorithm, QSAR rules are extracted automatically from dataset annotated with peptides' activity. From a chemical viewpoint, describing various types of amino acid inductions properly is important for improving QSAR analysis.
From a statistical viewpoint, features which maximize the accuracy of an algorithm for analyzing QSAR are the best. Kohavi proposed the relevance of features instead of maximizing accuracy of an algorithm. Discussions about relevance of features are popular in various types of algorithm . Relevance is defined as the difference between probability density function P(Y = y) and conditional probability density function P(Y = y|X i = x i ). When P(Y = y|X i = x i ) ≠ P(Y = y), X i is relevant. Otherwise, X i is irrelevant.
Definition of the base substructure set for amino acid inductions
Describe potential factors for interactions such as hydrogen bond acceptor.
Features of amino acid residues should be weak relevant to each other mathematically. This is the condition for avoiding strong relevant features. Abandon features with strong relevance.
A feature should have high entropy (in information theory) after mapping structures of 450 types amino acids to a sequence of bits. This is the condition for avoiding too specific descriptor. Abandon descriptors with low entropy.
The first item is essential for QSAR analysis because key substructures such as hydrogen bond acceptor may cause the activity of peptide for target protein. Under the condition lack of description of them, most of algorithms analyzing QSAR become powerless. The second and third items are necessary for efficient analysis from a statistical viewpoint. The second item prohibits the redundancy of features. Even if the structures of two amino acid inductions are chemically different, two features may be relevant to each other. Then, these two features are redundant statistically. The third item is necessary for generating robust QSAR rules. Features with low entropy (in information theory) lose generality.
The set of substructures Z includes the forty-nine substructures shown in Figure 2. These substructures are roughly categorized into three parts. Three categories are "the number of atoms", "Substructures" and "Properties". The number of atoms indicates how many atoms there are in an amino acid residue. "Substructures" indicates whether an amino acid residue has a specific substructure or not. "Properties" indicates whether an amino acid residue has some character from a viewpoint. For example, the first item of "Properties" describes the structure that is the methylene group and a hydrogen bond acceptor are connected via any atom.
An element z ∈ Z denotes each substructure shown in Figure 2. Then, we can define any substructures except z as z*. In other word, each element z* is defined corresponding to each z. The substructure z* is complement of the substructure of z because z ∩ z* = ϕ, z ∪ z* = All. Then, we define the set Z* as all elements z*. Finally, we define the base substructure set X as X = Z ∪ Z*.
Definition of a set of intermediate bindings between any two base substructures
The activity of a peptide is determined not only by the structure of each amino acid residue but also by the relationship among amino acid residues. Here, we define an intermediate binding between two amino acid inductions as the distance between any two base substructures.
The definition of intermediate bindings among base substructures is arbitrary. For example, we can define an intermediate binding among three base substructures. When we describe the relationship among m substructures, the number of combinations is O(n m ). Here, n is the number of substructures. The number of combinations increases by exponential order. To avoid the exponential order, we limited the number of substructures to 2.
Structures of peptides are more flexible than small compounds because peptides have many rotatable bonds. Descriptors for peptides should have a potential for describing the flexibility to obtain high accuracy.
Definition of substructure-pair descriptor
When x i , x j and y k are given, a peptide p a is converted to a bit with function F (x i , y k , x j , p a ). Here, we denotes the suffix set (i, j, k) as b. Then, we obtained the matrix (M ab ) = (F (x i , y k , x j , p a )) for the input of QSAR analysis algorithm. The vector (Ma1, Ma2, ⋯) is corresponding to the features of the peptide p a .
Results and Discussion
Definition of Datasets
We use two types of datasets for evaluation of the proposed descriptors. One is C5a inhibitors  and the other is kinase inhibitors . Positive data are defined as peptides with high inhibitory potential, and negative data are defined as other peptides and peptides with random arrays. Content of dataset is as follows.
The number of positive peptides: 116
The number of negative peptides: 451
The number of positive peptides: 24
The number of negative peptides: 325
Difference between SPAD and APH definition
SPAD is different from APH in defining whether any two substructures are connected directly to an intermediate binding. For example, when the main chain is connected to an aromatic ring of a side chain via a carbon chain and two amino acid residues have carbon chains which are different to each other in its length, APH classifies two amino acid residues. However, SPAD does not. The structures of amino acid residues are very similar so it is natural to consider that their properties are approximately similar. In this case, the descriptor that ignores the difference is better. The second different point between SPAD and APH is whether the information about properties is included in descriptors. It may be unnecessary to distinguish amino acid residues from a viewpoint of some property.
Comparison of descriptors correlated highly with peptides' activity
Capturing Area of APH and SPAD in active peptides
Definition of dataset for similarity search with Tanimoto coefficient
Peptides are classified in three categories:
non-active: negative peptides.
active reference: positive peptides which are the basis of similarity search with Tanimoto coefficient.
active: positive peptides except for active reference.
All peptides were ordered by descendent ordering with Tanimoto coefficient.
Comparison of the performance of SPAD with APH
When the structure of two peptides is similar and a descriptor captures a whole structure or property of peptides, these two features have similar sequences of bits. As a result, Tanimoto coefficient between these peptides becomes large. Structures of active peptides for a target protein are usually similar to each other because the pocket of target protein is same. When we describe peptides with a descriptor capturing whole peptides' structures or properties, Tanimoto coefficient between any two active peptides is larger.
Oppositely, Tanimoto coefficient between an active peptide and a non-active peptide is smaller because these two features are different to each other. However, if we describe peptides with a poor descriptor, we cannot always measure the similarity of peptides with Tanimoto coefficient. Poor descriptors break the similarity of structures at mapping to features. Therefore, Tanimoto coefficient is an indicator of the descriptor's performance.
The graph increases more rapidly as active peptides have larger Tanimoto coefficient than non-active peptides.
In both cases, C5a (left figure at Figure 7) and kinase inhibitors (right figure in Figure 7), the graph in case of SPAD is higher than the graph in case of APH. The enrichment factor with the SPAD is higher than with APH at any percentage of active peptides. Therefore, the SPAD translates similar structures to similar features more precisely than the APH. This fact means that the performance of the SPAD is higher than the performance of APH in the case of analyzing peptides' activity.
It is necessary for two-dimensional QSAR of peptides that are sequences of 450 types of amino acid inductions to capture various properties with descriptors. The atom pair holographic code and substructure pair descriptor that we proposed are such descriptors. APH captures internal characters of an amino acid induction. On the other hand, SPAD captures the relationship between two amino acid inductions. SPAD captures much more information for QSAR of peptides than APH and distinguishes active peptides from non-active peptides more accurately.
- Jain AN, Dietterich TG, Lathrop RH: Compass: A shape-based machine learning tool for drug design. Journal of Computer-Aided Molecular Design. 1994, 8 (6): 635-652. 10.1007/BF00124012.View ArticleGoogle Scholar
- Nielsen H, Brunak S: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Engineering. 1999, 12: 3-9. 10.1093/protein/12.1.3.View ArticleGoogle Scholar
- Carhart RE, Smith DH: Atom pairs as molecular features in structure-activity studies: definition and applications. Journal Chemical Informatic Computer Science. 1985, 25 (2): 64-73. 10.1021/ci00046a002.View ArticleGoogle Scholar
- Sheridan RP, Miller MD: Chemical Similarity Using Geometric Atom Pair Descriptors. J Chem Inf Comput Sci. 1996, 36: 128-136. 10.1021/ci950275b.View ArticleGoogle Scholar
- Nilakantan R, Bauman N, Dixon JS: Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors. J Chem Inf Comput Sci. 1987, 27 (2): 82-85. 10.1021/ci00054a008.View ArticleGoogle Scholar
- Helguera AM, Combes RD: Applications of 2D Descriptors in Drug Design: A DRAGON Tale. Current Topics in Medicinal Chemistry. 2008, 8 (18): 1628-1655. 10.2174/156802608786786598.View ArticleGoogle Scholar
- King RD, Muggleton SH: Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proceedings of the National Academy of Sciences. 1996, 93: 438-442. 10.1073/pnas.93.1.438.View ArticleGoogle Scholar
- King RD, Srinivasan A: Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. Environ Health Perspect. 1996, 104 (5): 1031-1040. 10.1289/ehp.96104s51031.View ArticleGoogle Scholar
- Finn P, Muggleton S, Page D: Pharmacophore Discovery Using the Inductive Logic Programming System PROGOL. Machine Learning. 1998, 30 (2-3): 241-270.View ArticleGoogle Scholar
- Nielsen H: Predicting Protein-Peptide Binding Affinity by Learning Peptide-Peptide Distance Functions. Protein Engineering. 1999, 3-9. 12
- Majeux N, Udaka K, Mamitsuka H: Prediction of MHC Class I Binding Peptides Using an Ensemble Learning Approach. Genome Informatics. 2003, 14: 687-688.Google Scholar
- Udaka K, Mamitsuka H, Abe N: Prediction of MHC Class I Binding Peptides by a Query Learning Algorithm Based on Hidden Markov Models. Journal of Biological Physics. 2002, 28 (2): 183-194. 10.1023/A:1019931731519.View ArticleGoogle Scholar
- Tian F, Zhou P: A novel atom-pair hologram (APH) and its application in peptide QSARs. Journal of Molecular Structure. 2007, 871 (1-3): 140-148. 10.1016/j.molstruc.2007.02.012.View ArticleGoogle Scholar
- Ahmed HE, Vogt M: Design and Evaluation of Bonded Atom Pair Descriptors. J Chem Inf Model. 2010, 50 (4): 487-499. 10.1021/ci900512g.View ArticleGoogle Scholar
- Rogers DJ, Tanimoto T: A Computer Program for Classifying Plants. Science 21. 1960, 132 (3434): 1115-1118.Google Scholar
- Willett P: Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today. 2006, 11 (23-24): 1046-1053. 10.1016/j.drudis.2006.10.005.View ArticleGoogle Scholar
- Akaike H: Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory. 1973, Akademiai Kiado, Budapest, Hungary, 1: 267-281.Google Scholar
- Kohavia R, John GH: Wrappers for feature subset selection. Artificial Intelligence. 1997, 97 (1-2): 273-324. 10.1016/S0004-3702(97)00043-X.View ArticleGoogle Scholar
- Zhao Z, Liu H: Spectral feature selection for supervised and unsupervised learning. ICML '07 Proceedings of the 24th international conference on Machine learning, Volume ISBN: 978-1-59593-793-3. 2007, ACM: New York, NY, USAGoogle Scholar
- Shannon CE: A mathematical theory of communication. Bell System Technical Journal. 1948, 27: 379-423.View ArticleGoogle Scholar
- C5a inhibitors [WO/2006/074964].
- Kinase inhibitors [WO/2003/059942].
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.