|Beyond ATCG: “Dixel” representations of DNA-protein interactions
Predicting Target Sites of Transcription Factors
Three broad classes of methods have been generally used for predicting target sites of transcription factors: sequence-base methods, energy-based methods and structure-based methods (Kono and Sarai, 1999). To date the most successful computational methods for the identification of these sites are based on models that represent DNA polymers by sequences of letters. These are often referred to as motif methods because they seek to identify the characteristic sequence patterns, motifs, of short spans of DNA sequence. Numerous algorithms have been developed to identify motifs from multiple observations, including Gibbs sampling (Lawrence, 1993; Neuwald, 1995), greedy consensus algorithms (Stormo, 1989 ) and expectation maximization (EM) algorithms (Lawrence, 1990; Cardon, 1992; Bailey, 1994; Lawrence, 1996). In general, the sequence data needed to train and/or validate these methods is quite limited. Because of these data limitations, nearly all of these methods employ models with relatively few parameters by assuming independence of the terms for each base in a DNA motif. In fact, some authors have developed computational methods that further reduce the number of free parameters by employing symmetry, (Thompson et. al., 2003), or via algorithmic steps that focus on the most conserved positions, such as the fragmentation algorithm of Liu et. al. (1995).
At the other extreme, higher order multibase models have also been employed (Fickett and Hatzigeorgiou, 1997; van Helden et al., 1998; Pavlidis et. al., 2001). There is evidence that the assumption that nucleotides of DNA binding sites can be treated independently is problematical in describing the true binding preferences of TFs (Bulyk et al., 2002). In was noted, that possible interdependence between binding residues should be taken into account and is expected to improve prediction (Mandel-Gutfreund and Margalit, 1998). Although additivity provides in most cases a very good approximation of the true nature of the specific DNA-protein interactions (Benos et al., 2002a), a recent study demonstrates that employing models that allow for interdependence of nucleotides within transcription factor binding sites can indeed improve the sensitivity and specificity of the method (Zhou and Liu, 2004). However, all of these motif modeling efforts are hampered by two major factors: small samples and an abstract representation of DNA polymers as letters that has little to do with the energetics of the binding of proteins to DNA.
The central hypothesis of the proposed study is that these limitations can be more effectively addressed using a more fundamental characterization of the DNA polymer, specifically through the use of selected electron density properties encoded on the surfaces of the major and minor groves of the DNA polymer.
Previous || Next