|Beyond ATCG: “Dixel” representations of DNA-protein interactions
Preliminary Studies of the Discriminant Potential of DIXELS
To investigate the capabilities of DIXELS, we started with the simplest task - supervised classification. We gathered a set of E. coli sigma 70 binding sites and a set of control sequences (non-sites) from intergenic regions of convergently-transcribed genes and from upstream regions of tandem transcribed genes. The overall task was to discriminate the sets of sequences of 29 nucleotides of E. coli DNA most likely to be sigma factor binding sites from control sequences. Classification methods based on sequence alone perform quite well on this task. Specifically, we found that the naïve Bayes approach (NB) of creating two generative models under the assumption of independence of the bases, followed by the application of Bayes Rule (Duda and Hart, 1973), did a good job of weeding out almost all of the non-sites. However, among the several thousand non-sites some were always predicted to be sites with high probability. Incorporating the DIXEL data and the sequence representation into a hybrid procedure, we focused on further distinguishing the sites from the non-sites among all the observations that the sequence based method (NB) predicted as sites with high probability.
To accomplish this we employed both an exploratory data analysis approach and a data mining approach. We adapted techniques from cheminformatics developed for predicting the bioactivities of small molecules in our prior NSF KDI project DDASSL
(Drug Design And Semi-Supervised Learning).
We used Kernel Partial Least Squares regression (KPLS) (Rosipal and Trejo, 2001) to address the dixel variables. KPLS is a member of the family of “Kernel” methods started by Support Vector Machines (Vapnik, 1996) and was first applied by us to problems in cheminformatics (Bennett and Embrechts, 2003),. Because the sequence based NB model provides a good first level representation of TFBS, we orthogonalized these dixel variables with the respect to the predictive inference probabilities of NB. Then KPLS was employed to compute a function to reduce the residual classification error on the training data. Our preliminary results show that the addition of dixel variables, EP and bare nuclear potential (BNP) and PIP, to sequence variables holds the most potential to capture higher order effects and reduce classification error.
DNA/protein binding site identification and quantification is a key component of DNA bioinformatics and gene regulation research. The availability of DIXEL descriptors to translate DNA sequences into chemically-relevant information will provide a data-rich environment for testing machine learning and data mining tools.
Previous || Next