|Beyond ATCG: “Dixel” representations of DNA-protein interactions
Principal Investigator: Curt M. Breneman
Co-Investigator: N. Sukumar
Department of Chemistry and Chemical Biology and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute
In April 2003, the sequence of the human genome was completed, and numerous other genomes have been and are now being sequenced. Although these are significant achievements, much remains to be done. While reasonable progress has been made toward finding the identities and locations of genes within the data, the identities of other functional elements encoded in the DNA sequence - such as promoters and other transcriptional regulatory sequences - remain largely unknown. The sequence-specific binding of various proteins to DNA is perhaps the most fundamental process in the utilization of these other functional elements encoded in the DNA. For example, transcription regulation, which is achieved primarily through the sequence-specific binding of transcription factors to DNA, is arguably the most important foundation of cellular function, since it exerts the most fundamental control over the abundance of virtually all of a cell’s functional macromolecules. Because of this fundamental role, the study of transcription regulation will be critical to our understanding and eventual control of growth, development, evolution and disease. As part of this proposal, we seek support to develop improved computational technologies for the identification of transcription factor binding sites (TFBS) in DNA through cheminformatic techniques and to develop a framework for generating a broad molecular understanding of the selectivity of binding of such regulatory elements to specific DNA sequences.
Three broad classes of methods have been generally used for predicting target sites of transcription factors: sequence-base methods, energy-based methods and structure-based methods (Kono and Sarai, 1999). To date the most successful computational methods for the identification of these sites are based on models that represent DNA polymers by sequences of letters. These are often referred to as motif methods because they seek to identify the characteristic sequence patterns, motifs, of short spans of DNA sequence. Numerous algorithms have been developed to identify motifs from multiple observations, including Gibbs sampling (Lawrence, 1993; Neuwald, 1995), greedy consensus algorithms (Stormo, 1989 ) and expectation maximization (EM) algorithms (Lawrence, 1990; Cardon, 1992; Bailey, 1994; Lawrence, 1996). In general, the sequence data needed to train and/or validate these methods is quite limited. Because of these data limitations, nearly all of these methods employ models with relatively few parameters by assuming independence of the terms for each base in a DNA motif. In fact, some authors have developed computational methods that further reduce the number of free parameters by employing symmetry, (Thompson et. al., 2003), or via algorithmic steps that focus on the most conserved positions, such as the fragmentation algorithm of Liu et. al. (1995).
At the other extreme, higher order multibase models have also been employed (Fickett and Hatzigeorgiou, 1997; van Helden et al., 1998; Pavlidis et. al., 2001). There is evidence that the assumption that nucleotides of DNA binding sites can be treated independently is problematical in describing the true binding preferences of TFs (Bulyk et al., 2002). In was noted, that possible interdependence between binding residues should be taken into account and is expected to improve prediction (Mandel-Gutfreund and Margalit, 1998). Although additivity provides in most cases a very good approximation of the true nature of the specific DNA-protein interactions (Benos et al., 2002a), a recent study demonstrates that employing models that allow for interdependence of nucleotides within transcription factor binding sites can indeed improve the sensitivity and specificity of the method (Zhou and Liu, 2004). However, all of these motif modeling efforts are hampered by two major factors: small samples and an abstract representation of DNA polymers as letters that has little to do with the energetics of the binding of proteins to DNA.
The central hypothesis of the proposed study is that these limitations can be more effectively addressed using a more fundamental characterization of the DNA polymer, specifically through the use of selected electron density properties encoded on the surfaces of the major and minor groves of the DNA polymer.