Statistical Models for Protein Folding Pathways
Co-Investigator: Chris Bystroff
Associate Professor, Department of Biology, Rensselaer Polytechnic Institute
Protein folding will only be fully understood when we can accurately model the physical process of folding -- the folding pathway. Folding pathways have evolved along with proteins, and the secrets of these kinetic intermediate states are saved in the database of protein crystal structures. Hypothesis-driven database mining and modeling experiments can find these intermediates. Each experiment is a statistical test of a hypothesis about the nature of folding. The result of the experiment is a predictive model that can be validated using cross-validation, blind predictions, simulations, and laboratory experiments.
We are developing five hierarchical statistical models which represent progress toward a comprehensive statistical model for folding pathways. The models are built on the principle that a non-redundant data set of structures obeys Boltzmann statistics, populating the data proportional to the free energy. A recurrent theme in the data implies an energetic selection pressure, therefore evolutionary conservation. Care is taken to account for missing data and redundancy at each hierarchical level. We have shown that no sparse data limit is encountered when statistical modeling is done hierarchically. The combined five models represent energetic preferences for the next step along the pathway given the structures from the previous step -- a decision tree in sequence-structure space where each model feeds into the next, as follows:
- Initiation. Short motif sequences that fold independently.
- Propagation. Local extension of the structured region, building on initiation sites.
- Condensation. Contacts form between pairs of pre-folded local structure units.
- Packing. Groups of pre-folded units pack in a space-filling way, independent of their sequential ordering.
- Topology. Loop lengths and the sequential ordering of pre-formed groups constrain the possible fully folded topologies.
Both native substructures and non-native, off-pathway intermediates are part of the model at each stage. Experimental validation of model predictions is underway using designed proteins.