Cluster Quality based Non-Reductional (CQNR) oversampling technique and effector protein predictor based on 3D structure (EPP3D) of proteins

Article Type

Research Article

Publication Title

Computers in Biology and Medicine


Background: Effector proteins of bacteria infect their hosts by specific dedicated machinery identified as secretion systems. Currently, no mechanism to identify the effector proteins based on their 3D structure has been reported in the literature. In order to identify effector proteins, extraction of features from their 3D structure is crucial. However, effector protein datasets are highly imbalanced. State-of-the-art oversampling algorithms are incapable of dealing with such datasets. They usually eliminate samples as noise. They do not ensure generation of synthetic samples strictly in the vicinity of the minority class samples. In effector protein datasets, deletion of any samples as noise would lead to loss of crucial information. Furthermore, generation of synthetic samples of the minority class in the vicinity of majority class samples would lead to an inept classifier. Method: In this paper, we introduce an algorithm called Cluster Quality based Non-Reductional (CQNR) oversampling technique. Its novelty lies in generating new samples proportional to the distribution of samples of the minority classes, without eliminating any sample as noise. Utilizing CQNR, we develop a novel Effector Protein Predictor based on the 3D (EPP3D) structure of proteins. EPP3D is trained on a feature set, balanced by CQNR, comprising 3D structure-based features, namely, convex hull layer count, surface atom composition, radius of gyration, packing density and compactness, derived from the 3D structure of the experimentally verified effector proteins. Result: Fscore and Gmean demonstrate that CQNR has outperformed some well-established oversampling methods by approximately 3–5%, with respect to classification accuracy, on five benchmark datasets and three other highly imbalanced synthetically generated datasets. Likewise, for classification of pathogenic effector proteins, a significant improvement of 7–9% in accuracy has been noticed, on the application of CQNR followed by EPP3D. Moreover, EPP3D has exhibited an improvement of 2–4% in classifying effector proteins based on their 3D structure compared to the classification of effector proteins based on their amino acid sequences. The software for CQNR and EPP3D are available at http://projectphd.droppages.com/CQNR.html.



Publication Date