Exhaustive Search Techniques
Now, you may be saying to yourself "since proteins fold themselves
into an optimal conformation in nature, finding the optimal fold
for a protein should be a simple matter of using a computer to try
all possible conformations and then pick out the optimal fold".
Simple in concept and simple to implement as a computer program,
but not simple for even the fastest computers on earth to solve
because there are just too many combinations of conformations to
test. The runtime complexity to fold even a simple abstract protein
only 10 amino acids long is 5**10 comparisons. Given that
each comparison requires a microsecond to complete, 30 years will
be required to find an admisible optimal protein fold.
And that's just a simple abstract protein only 10 amino acids in
Using exhaustive search, the time required to find the optimal
conformation for a decent sized human protein would require more
time than in the current age of the universe. More intelligent methods
must be discovered to break down the complexity of these folding
Enter The Genetic Algorithm
Since we can't expect to use exhaustive search techniques or admissible
heuristics to solve this problem, many research efforts have turned
to the genetic algorithm to find optimal solutions. In this role,
genetic algorithms serve as smart search procedures while also incorporating
a learning mechanism for enriching the search space. And these smart
search techniques have been shown to converge on solutions in far
less time than canonical search techniques.
The problem is that there is no free lunch. Unlike exhaustive search
techniques, genetic algorithms utilize non-admissible heuristics
and thus cannot gaurentee polynomial time solutions... and
even if a GA arrived at the optimal solution, it wouldn't know that
the solution is optimal because it hasn't examined all of the conformations
possible. Gaurenteeing a solution is tantemount to saying that you
know the optimal solution's fitness value ahead of time. But since
fitness relies on structure, you would have to know the structure
ahead of time as well. Clearly this is contradictory.
This is the so-called "termination problem" in GAs, and
is also an unsolved problem.
The goals of my thesis project are thus twofold:
- Reduce the search space so as to find more optimal or near-optimal
solutions. A major difficiency of conventional GA approaches is
that they spend a lot of time searching the erroneous parts of
the solution space. My GA approach eliminates invalid folds so
that more energy is applied to searching spaces that have a greater
likelyhood of containing solutions. The solution space is reduced
in size until an optimal, or set of optimal solutions, is found.
- Solve the termination problem. Clearly we need to know the one
optimal solution (if it exists) when we see it. This is not an
easy problem to solve and is addressed by accepting results that
are within a certain percentage of the theoretical optimal.
The performance goal for this system is to fold a 10,000 amino-acid
chain in low-order polynomial time, making the program a useful
starting point for predicting realistic protein structures. Experiments
conducted in "blocks world domains" where the proteins are only
a few dozen amino acids long are totally unacceptable.
To verify that this scheme works in polynomial time, we will compare
the theoretical worst-case performance of a theoretical exhaustive
search approach against the experimental results obtained by the
project. We will assume that the theoretical model uses atomic operations
that require a microsecond to complete (which is quite generous).
If we assume that one operation requires one microsecond then an
exhaustive search approach should require 5**n search comparisons
where n is the length of the protein in amino acids.
Now we choose a protein string length that would make the folding
experiment intractible for practical purposes. A protein with a
mere 25 amino acids requires 5**25 microseconds or 298023223876.9
seconds. That's about 94,502 years. If the GA approach to protein
folding manages to terminate with the optimal solution (question
- how do we know the optimal solution if we can't use an exhaustive
search technique?) then we know that the system is running within