What is this research thing that I'm always saying I should be doing? Well, it involves designing sequences for protein structures.
Proteins are made up of long chains of amino acids that fold into 3d structures. The shape of the protein helps it bind to other molecules and thus perform functions in our bodies.
Sequence design is about creating a sequence from a structure. It is useful in drug design and exploring the sequence-structure relationship, among other things.
As shown above, for every structure we want to design a family of sequences. From this family we gather information about what amino acids, and in what positions, are important to the structures.
The number of possible sequences is very large 20^N, where N is the number of amino acids in the chain and 20 is number of different amino acid types. However, the number of distinct protein structures is not that big. We want to find out more about which of the 20^N possible sequences would fold to a given structure and which would not.
Because we don't want to test 20^N possible sequences we use a directed search. This is based on the Metropolis Criterion that lets us randomly walk through the possible sequences to find one that has high probability of folding to the desired structure. This probability is based on a computed sequence-structure compatibility that depends on a lot of things. i.e. that's the critical part.
The stuff I described above is mostly old news. What I've done is improve the performance of the algorithm, both from a runtime standpoint and quality of results.
Our cells have water in them?
Proteins interact with the stuff around it, which is mostly water. It's important to understand this interaction to get any sort of useful results.
Some of the protein's atoms like water and some don't. When proteins fold the ones that don't like water tend toward the inside, while those that like water head to the outside. If we propose a model for a protein we need to make sure it follows this pattern. Quantifying this, quickly, is the goal. We use an implicit solvent model to approximate the water's effect instead of dealing with every water molecule individually.
We start by modeling each atom in a protein as a sphere and consider the surface areas of the spheres covered by the different atoms. But what do you do with the parts of the surface that are covered by more then one atom? How do you split it?
To make a long story short we use the Delaunay and Voronoi diagrams. Why? Because they provide a intuitive solution. They've also been studied pretty extensively. It's that whole "standing on the shoulder of giants" thing.
Once you have all this information you need to actually compute the areas. While computing these portions of the surface area of a sphere may seem like an easy problem, it has never been clearly spelled out before.
With all the surface areas you can then weight them depending on what atoms are involved and get your answers.
Of course the hard work is in the details.