Proteins are both the engines and the building blocks of all living
things, thus an understanding of their structure and behavior is
essential to understanding how living things operate. My thesis
project is a computer program designed to predict the three-dimensional
structure of proteins given only their amino acid sequence. This
system is also the first of a family of computer programs whose
purpose is to assist analysts exploring protein structure and function.
The analysis is performed entirely in the digital domain, using
only existing DNA and RNA sequence data, protein homology and characteristcs
How Proteins Fold
All proteins in nature are made up of chains of molecules called
"amino acids". Cells create proteins by "transcribing" them from
RNA sequences (themselves being created from DNA sequences). When
proteins are transcribed from RNA they start out as linear sequences
of amino acids. Because the amino acids that make up a protein have
various electrostatic and mechanical properties, the protein doesn't
stay in this dentured form for long and begins to fold up into a
three-dimensional structure. It is this three-dimensional structure
(as well as the mechanical and electrostatic properties of the amino
acid sequence) that gives the protein its functionality.
For example, two proteins might fold themselves in such a way
that one protein presents a "lock" binding site to the other protein's
corresponding "key". Fitting the key into the lock produces an electrochemical
reaction that performs some essential cellular function.
The transcribed sequence of amino acids that form a protein are
called the protein's "primary structure". The folded form of the
protein in three-space is called the protein's "secondary structure".
The secondary structure of a protein is determined in large part
by the mechanical and electrostatic effects of neighboring amino
acids. Proteins also have "tertiary" and "quartenary" structures.
The tertiary structure refers to the overall folding path of a protein.
For example, a protein might have a helical secondary structure
whereas its tertiary structure might fold the overall protein into
a "supercoil" where the helical protein coils around itself. The
mechanics of how a protein can fold, determine a protein's structure.
Tertiary structure prediction is the rough part and the focus of
my thesis project, although to predict an overall fold, all constraints
from local to global folding must be considered.
The quartenary structure of a protein refers to an assemblage
of multiple protein strings along with the so-called "post-translational
modifications" to the protein strings. "Post-translational modification"
means folding or alterations of the protein string that have occured
outside of the protein's inate structure or expression. A good example
of a post-translational modification is the addition of a "heme
group" to hemoglobin molecules. Without this heme group, red corpuscles
would be unable to carry oxygen.
One feature of proteins in nature that seems to be very consistant
is that when they do fold, they fold into the most energy-conservative
structure possible, that is to say that the amino acids are at total
rest and the protein is expending no energy to maintain its structure.
This fact provides us with a key to reliably predicting a protein's
structure. Theoretically, all we have to do is find the optimal
conformation among all the possible conformations a protein can
In practice, however, this is an impracticle solution. The amount
of time required to test all possible conformations that a decent
size protein can take on is far greater than the age of the universe,
even for the fastest computers.
Hydrophobic Packing Models
One of the properties of amino acids which is thought to determine
most of a protein's resulting structure is the amino acid's "hydrophobicity",
or its afinity for water. This makes sense, because all proteins
are folding within a cytoplasmic medium which consists of mostly
water. If one labels each amino acid as "hydrophillic" or "hydrophobic"
and then considers this property as the only mechanism of folding
(but retaining a protein's expected sequential structure) then one
has a macroscopic model for folding abstractions of proteins; hydrophobic
amino acids move towards each other and the protein's "center" away
from the cell's cytoplasm. To further simplify the problem (but
not remove the essential computational complexity of the problem)
we can perform this folding within a discrete cartesian lattice
Such abstract models of proteins are termed "Hydrophobic Packing"
models or "HP models" for short and have been investigated by many
researchers, most notably K. A. Dill.
These abstracted protein models are no less difficult to solve
computationally. Abstracting the problem just removes the noise
from the problem and allows us to focus on the core difficulties
of predicting protein structure. To this day, the protein folding
problem, as well as the prediction of abstracted protein folds,
remain unsolved problems.
Even after the protein has been abstracted, the protein folding
problem appears to retain its NP-complete characteristics, which
is good because we want to remove the problem background noise without
removing the constraints that make the problem difficult to solve...
and thus scalable to folding real-world proteins. Since the protein
folding problem is generally regarded as NP-complete, I have discarded
from consideration any conventional problem-solving techniques (such
as exhaustive search of the solution space). Any solution to this
problem must fold real-world size proteins within polynomial time.