Outlined below is a diagram and step-by-step descriptions of how Homology Modeling (HM) runs. HM is the easiest tool to run in CAD, but it performs a complex set of functions. You can run HM by inputting a sequence and come back later to get your structures. In the meantime, our software is running an elaborate search for templates, running thousands of modeling steps, then curating the best results after repeating the process 200 times.
You may find it helpful to read the HM diagram alongside the step-by-step description below.
STEP-BY-STEP DESCRIPTION OF
HOMOLOGY MODELING
TEMPLATE SEARCH (RUNS ONCE):
1a) Identifying Distant Homologs in the PDB with HHsearch and SparksX:
When your sequence is submitted for an HM job, HM finds homologs in the PDB to use as templates for structure prediction.
HM creates a “profile” for your input sequence. This is far more informative than a simple BLAST. We use HHsearch and SparksX which makes multiple sequence alignments that will find distant homology information for the profile. Profiles are also created for structures in the PDB. The profile of your target will be compared to all PDB profiles to find matches. The profiles contain information about the protein in evolutionary terms. A protein may have a region prone to gaps, a region most commonly predicted to be a helix, or have a set of likely mutations. These types of features are compared to PDB profiles to find the most hits and perform the alignment.
Your sequence will be aligned to template PDB based on their profiles. The alignment is based on evolutionary information inferred from the profile so cannot be created with only your target and template sequence. For example, below is a sequence which was aligned to a template using the HHsearch profile. This looks significantly different that the alignment formed by CLUSTAL, when it aligns the 2 sequences without the profile information. Creating a model based on the CLUSTAL alignment would look vastly different and be inaccurate.
1b) Weighted Sequence Alignments
Each of these alignments is given a rank based on how close it’s profile is to your input sequence’s profile. The best alignment will have the highest weight, making it the structure with the highest probability to be used as the initial conformation for modeling, and the next best alignments will be given lower weights. The initial conformation will bias modeling towards that original structure. So weighting results is biasing the whole modeling process even though it is only used during this step. Overall, the top two or three alignments for HHsearch and SparksX will have the strongest influence on the final structures generated.
Once your input sequence has been aligned to the sequences of known structures in the PDB, we take each homologous structure and swap your sequence in place of the template according to the alignment. This method is called threading and is described next.
1c) Threading:
HM will begin modeling your predicted structure by threading the backbone residues of your input sequence over the structure of the homologous templates. This structure will have the identical backbone as the template, but the side chains will be your sequence. Regions in the template that are not aligned are not included in the structure. These missing sections will be discussed below.
Side chain atoms will be replaced with one representative atom called a “centroid”. A centroid atom has the general size and chemistry of the whole side chain, but it is a single atom that doesn’t exist in nature. It’s purely used for computational ease. We fit together chunks from other proteins with centroids from your sequence to see which chunks fit together best. We will fine tune the details later. This method makes it easier to get the general positioning of all segments without exhaustive searching that could literally take a year of computational processing.
Your input sequence will likely not align with the entire template structure for all of the templates used for modeling. However, this is not a problem because HM will be sampling only small sections of each template at a time. Since the modeling process involves swapping small section from templates until it finds the best combination, each template does not need to have full coverage.
Threaded models with centroid side chains for all templates (up to 30) are created and they are all globally aligned. These will be used in three ways. First, one of the threaded templates will be used as a starting conformation for modeling (missing regions will be added in this case). Second, these models are used as distance constraints to score conformational changes during modeling. Thirdly, subregions of these models will be swapped into the structure in order to sample all the templates for partial fit. These three methods are described below.
1d) Template Fragments:
As we just mentioned, small regions from the templates are swapped into the structure in order to sample regionally changes. These are the templates that can match large regions of your sequence, but only small bits are used for fragment sampling. This allows the nature of all templates to affect the final structure. However, the templates whose chemistry matches the sequence best will be more likely to be accepted during sampling. So poorly matching templates usually don’t skew results in the wrong direction.
Each template will be subdivided based on secondary structure. HM uses DSSP (Define Secondary Structure of Proteins) to detect secondary structure elements in the PDBs. Then we subdivide the structure based on the following rules:
- A helical region of 6 residues or longer will be considered one fragment from the template.
- A beta sheet of 3 residues or longer is a fragment.
- A loop will be split in half and included in the fragment proximal to it.
- If a loop is under 3 residues, the loop and its upstream and downstream structures will be treated as one fragment. This increases the probability that a small, structural motif like a beta-hairpin or a kinked helix will be one fragment.
2) Ab Initio Fragment Selection
We have a large library of fragments from the PDB that are three to nine residues long. The fragments contain a small section of a helix, sheet, or loop from known structures. Though not comprehensive, these represent a wide range of potential conformations that can be sampled during modeling. We use a sliding scale along the input sequence to select fragments from the library that are similar by sequence or secondary structure prediction.
These fragments, termed Ab Initio fragments, lack the homology relationship in an evolutionary sense, but are an excellent way to help find favorable conformations in the gaps where an evolutionary homology cannot be found.
MODELING (RUNS 200 TIMES)
STAGE ONE (Building Frankenstein’s Monster):
In Stage One of modeling, we start with a random template structure and sample fragments from other templates and Ab Initio fragments to model conformational change. This stage makes 10,000 structural sampling moves in a Monte Carlo approach with Metropolis Criteria (MCMC). It uses a simplified scoring method at each move to allow course-grain optimization biased by homologous structures. This allows fragment ends to have some-what poor geometry because these will be optimized in later stager. However, this is known to be an efficient way to fit together fragments of the best templates even with low homology.
3) Randomly Choose One Threaded Model to Start with:
At the start of the modeling process one threaded model will be randomly chosen as the starting structure. While the threaded model may only cover part of the input sequence, other fragments will be chosen at random to fill in the gaps. The gaps are filled in using Ab Initio fragments only, not fragments from other templates. For example, if the randomly chosen template only aligned to 50% of the protein, the remaining regions will have Ab Initio fragments between 3 and 9 residues long. See below.
The full length protein is analogous to Frankenstein’s monster which was pieced together from the parts of many people. However, we will be doing a lot more trial and error to mix together potential protein chunks to find the best combination. This initial structure will often appear as non-native as Frankenstein’s monster appears inhuman. However, it will look much different after Stage One of modeling iteratively swaps fragments across the entire input sequence.
The template weights come into play when creating the starting structure. The top templates for HHsearch and SparksX are both weighted highest so are the most likely structures to become the initial structure. Then their second best hits are the next most likely and so on. Since the whole HM modeling process is done 200 times with the same input, the major difference between the individual runs is the identity of the initial structure.
The two largest weighted structures have ~20% probability each to be chosen as the initial input structure. So on average, the two top hits will be the starting structure for 80 of the 200 models. The next two have ~10% probability (40 of 200 models on average), then 4% and so on. For each of the 200 runs, the starting structure is created with a random template with gaps filled in with Ab Initio fragments. The remaining threaded templates are aligned to the starting template in order to find a good orientation between the initial pose and the templates whose fragments will be swapped in. Therefore, templates that have no sequence overlap with the starting template (meaning they could only align with gapped regions which are filled in with Ab Initio fragments) will not be used for this HM run. In this way, all 200 modeling runs will have a slightly different, though overlapping set of templates.
Alternate Approach: Choosing your own templates
We will soon be implementing the ability to choose your own structures to be used as templates during structure prediction. In this case, you will upload your structure from your desktop or straight from the PDB into Bench. Then indicate that it will be included for structure prediction with HM. In this case, your sequence will be thread onto the input structure and will be used as the starting conformation for modeling. If there are regions of your sequence that do not align with your starting template, the initial structure for the gapped regions will be filled in with random Ab Initio fragments.
If you enter multiple input templates, they will all be equally weighted and have an equal probability of being the initial conformation for any of the 200 repeats of the modeling protocol. Our automated approach will still look for additional structures to use as templates to guide conformational sampling during the HM run. Fragments from those templates will be swapped into your structure as described later in this document.
Template selection can be incredibly useful in many ways. For example:
Proprietary Structures
Many companies have proprietary crystal structures for their protein targets that they can’t add to the PDB. However, these could be very useful when trying to run HM with a similar protein. A user can load the proprietary structure into their Bench project and be assured that it is secure. There will be no outside access to the proprietary structure. So this and other structures can be used as an HM template to create a far superior model than without the proprietary structure, since HM success is strongly dependent on detection of structural homologs.
Preferred Structural Variants
Many HM runs find a large number of PDB hits that make good templates due to high coverage and homology. However, the hits can have multiple conformations due to conditions, such an enzyme which can have an Open or Closed state. Choosing the preferred state is now possible by specifically designating a template. The alternate templates are likely to be included for modeling. However, automated template selections are weighted far less than your custom template selection. So we will sample fragments of the homolog rather than use it as a starting structure. So your template will bias the final structure far more than the others.
Bias Modeling Towards Conformations with a Binding Partner
Our automated HM tool will eventually allow addition of ligands or another protein during an HM run. This is often necessary to achieve the conformational changes that only occur in the presence of a binding partner. However, an imperfect approximation of the bound state can be achieved by finding the structure of a homolog bound to your binding partner or a homolog of your binding partner (as long as it has a similar binding pose). By selecting the bound structure of a homolog as a template, the conformation of the bound state will be used as a starting conformation. So the final predicted model will be biased by the binding mode and will often retain that bound conformation.
Automated Method Ignores Templates for an Important Region of Your Protein
Our automated HM method selects templates that have high levels of similarity and coverage for your sequence. This may lead to structure predictions that fail to use templates that are necessary for proper modeling of an important, but small region of your protein. For example, an active site can have a unique motif found in other templates. But the templates were ignored or weighted very low because the rest of your sequence had many high homology hits. By choosing a template with the correct active site motif, HM will use this as a starting structure. So, you can be certain that this region will be used and the automated detection of other templates should still find good hits for the remainder of your structure.
4a) Swap in Ab Initio or template fragments:
The first thing done during each round of HM Stage One, is the random selection of an Ab Initio or template frament. This fragment is swapped into it’s matching section in the model. In our Frankenstein’s monster analogy, we would randomly choose a body part from our pile of body parts. Note that it is an elbow. Then swap in the new elbow where the monster’s elbow was. In the next step, we will decide if the new elbow is an improvement over the old elbow.
During the early rounds of Stage One, the fragments are poorly fit at the junctions and have some global oddities that will need to be optimized in later stages. By swapping in parts iteratively across the protein (or the monster’s body) we are hopefully finding the best combination of fragments (or body parts). At each round, we sample a new fragments. So, 10,000 fragment samples are swapped in during Stage One. Each round alternates between swapping in a template or an Ab Initio fragment. So 5,000 of each type will be performed.
When Ab Initio fragments are swapped into a region, this is done by replacing the torsion angles for the backbone with that of the fragments torsion angles. When template fragments are swapped, this is done doing a rigid-body alignment of the homolog to the region it is replacing in cartesian space.
In either case, the regions between fragments will have non-native structural abnormalities. In the next section, we describe how we score each fragment insertion, then based on the score we decide whether the fragment swap was an improvement. Thus we slowly build the optimal combination of fragments.
4b) Score the structure:
After a new fragment is swapped into the structure, the whole structure is then scored with three score terms. The sum of those score terms will be compared to the score before the swap in order to determine whether we keep the change.
Term 1 – Rosetta Centroid Energy
This scores backbone atoms and a representative centroid atom for each side chain with a modified Rosetta Energy function. The score only considers how well packed the atoms are in the first iteration, but adds more score terms over time. After many rounds, it adds a score term for pairs of residues that is secondary structure dependant. It also adds a term for buried hydrophobic residues. And eventually adds the remaining Rosetta Energy scoring features. In the Frankenstein analogy, we are just seeing if the body part generally works, not whether its a fully healthy body part in the early stages. We are wearing rose-tinted glasses that make everything look a little better than reality. But later on we measure more levels of health, such as blood flow and range of motion. But we won’t take off the glasses till stage three.
Term 2 – Distance constraints for threaded templates
Up to 30 homologous structures were selected as templates for HM. These all had the input sequence thread onto their structures and were globally aligned to the starting structure. At each round of scoring, the model is compared to all the threaded templates to calculate similarity. This is a way to score how close the HM model is getting to distant homologs. Since not all structures will have full sequence alignment, the score of each region is normalized to the number of templates used for that region. In the Frankenstein analogy, we are measuring how similar this monster variation looks like the individual people used to make up the monster.
Term 3 – Chain break penalty
After each fragment insertion, the gaps between fragments has a strong possibility to end up with gaps. So the peptide bond between fragments will be far from ideal. A chain break penalty will favor fragment insertions that minimize physical gaps between consecutive residues. The chain break penalty start with a weight of 0 and ramps up to 100% by 5,000 iterations. It remains 100% for the remaining 5,000 iterations. In the Frankenstein analogy, we are looking to see whether each body part is stitched together at the ends in a clean way. The parts can be loosely held at first, but after 5,000 swaps we aren’t likely to accept new body swaps that don’t stitch together properly.
4c) Use Metropolis Criteria to accept or reject the changes:
If the new structure has a lower energy than the starting structure for that round, then the change will be accepted. When a change makes the score worse, there is a probability that it will still be accepted. The probability is proportional to how much worse the score gets. This is called the Metropolis Criteria. Occasionally unfavorable fragment insertion allows minor clashes that might be resolved with the right combination of fragments. Thus leading to a better final mix of fragments.
In the Frankenstein analogy, if we swapped in an elbow and the elbow is better, then we are totally keeping the elbow. If it’s worse, we roll a 100 sided die to see if we will keep the bad change. If the elbow is only slightly worse, then we might keep it if we roll over a 70. If it’s a very bad elbow, we only keep it if we roll 100. Occasionally, a slightly bad elbow will be kept. But in future rounds, it turns out the slightly bad elbow was great because it’s the only elbow that hooked onto the best lower arm. So good things occasionally happen with bad fragments.
The best structure moves on:
The cycle of randomly combining fragments is repeated 10,000 times. The structure with the best energy of all sampled combinations then moves on to the second stage. This structure often has distortions at the boundaries where fragment insertions occurred that will by improved with additional modeling. We may have found the ideal set of body parts for our monster by round 7,000, but we automatically continue trying new combinations until round 10,000.
STAGE TWO (Fixing Frankestein’s Mistakes):
While Stage One generated a model with generally correct topology, the details of the structure strongly diverge from a natural protein. Stage two will continue to sample conformational changes is the structure, but will focus on improving the gaps and other distortions in the structure. Stage One is also overly biased by templates. In the Frankenstein analogy, we have a monster with a great combination of body parts that work generally well together, but they are stitched together horribly. Now we need to check our pile of body parts to specifically find parts that fix the regions that have the worst stitching. And we are going to be pickier about our selections. We are properly examining the body for abnormalities and fixing them. However, we are still wearing rose-tinted glasses so we are still kinda easy to please.
5a) Swap in a fragment, preferentially to distorted regions:
Stage two will continue sampling new conformations by randomly swapping out regions with fragments from Ab Intio and homology templates by superposition of the fragments over the region it is replacing. However, we are now analyzing the model in order to find a section of the protein that has bad geometry. We preferentially choose bad sections of the protein and find a fragment that aligns to the sequence in that ugly section. In the Frankenstein analogy, we look over the monster and pick an ugly spot and choose a matching body part and replace it.
If an Ab Initio fragment is used for the swap, it will be aligned to the cognate region of the model by optimizing alignment of the N and C end residues to the model. Since the fragment already has good internal bonds, alignment at the ends will start closer to ideal if the N and C ends are placed well. However, the next step will optimize the conformation further. Ab Initio fragments are swapped for 500 of the 1,500 cycles of Stage Two. In the Frankenstein analogy, a new knee is used to replace the ugly existing knee, but the positioning of the ends match the ugly knee in relation to the upper and lower leg.
On the other hand, if a template fragment is used for the swap, the fragment is aligned to its cognate section by overlay with all residues, not just the ends. Templates are treated differently because they represent an evolutionary relationship, so a global positioning is more likely to be relevant. Template fragments are swapped for 1,000 of the 1,500 cycles of Stage Two. In the Frankenstein analogy, we are choosing body parts for the legs that are all from a bunch of cousins because they all have similar height and go together well. So we swap in a cousin’s knee in an orientation that is most similar to the whole ugly knee. Not just the ends.
5b) Cartesian, gradient-based minimization of the structure:
After the fragment swap, the entire structure will undergo coarse grain optimization using the Rosetta centroid energy function which scores structural details in a more stringent way than Stage One. It examines side chain pairings, the environments for each residue, buried polar residues, and native-like packing in the core. This causes shifting in the backbone and centroid torsion angles and bond lengths to optimize these features. Since all atoms are allowed to shift, there will be improvement within and between fragments.
In the Frankenstein analogy, the monster’s body goes through some healing at the stitches and between regions that are stitched together. His internal repair processes have activated and are restructuring things that its body finds to be abnormal. Though, his repair processes are still a bit confused about the details. So, only big things get fixed.
5c) Score the structure:
After the minimization, the final score of the structure represents how favorable the results are for the swap and optimize steps. The swap may have been so unfavorable that the optimization step did not cause overall improvement.
5d) Use the Metropolis Criteria to accept or reject the changes:
Just like in Stage One, if the changes to the model give the structure a lower energy then the changes will be accepted. This process of sampling new conformations and minimizing the structure will be repeated 1,500 times and only the structure with the lowest energy will continue onwards.
STAGE THREE (Turn the Monster into a Human):
When structures reach the last stage of HM modeling, we are getting close to a real native structure. However, we have done the modeling in centroid mode and not using the full atom energy function. Stage Three is when we add all side chain atoms and use a stringent scoring term. In the Frankenstein analogy, the monster is looking a lot like a real person, but now that we take off the rose colored glasses, we can really see the flaws. It’s time to do precision surgery to turn the monster into a real, normal human being that won’t scare the villagers.
6) Add side chain atoms to the model:
The final structure from Stage Two has all side chains represented with a centroid atom. Each centroid will be removed and replaced with all side chain atoms. The initial position will be a random, but ideal side chain rotamer. Now that this detail is added, a full optimization of the structure can be done. In the Frankenstein analogy, this step isn’t that applicable, but just for fun, let’s dress the monster in a fancy suit and give him a top hat.
7) Optimize the structure with full-atom energy function:
The entire structure now undergoes fine grain optimization with the side chain atoms included. This process uses the Rosetta full-atom energy function and performs the Relax protocol describes here. Again, there is no Frankenstein analogy for this step, but since he’s all dressed up, he should do a nice dance, a fully energetic dance.
For every HM job, the modeling process, stages one through three, will be repeated 200 times so that 200 different predicted structures will be generated for your input sequence. In the Frankenstein analogy, there will be 200 human-like people that have a different combination of body parts that were healed into a very human condition. But all the people will look a little different. If all the jobs converged onto essentially the same answer, the humans will look very similar. If not, you can end up with a strange assortment of folks.
CLUSTERING STRUCTURES (Run once):
To better compare and understand the 200 predicted structures, they are first clustered into groups based on shape similarity. If clusters readily converge to one shape, it often suggests that the process successfully predicted your structure with high resolution. However, if no real homology was present among the templates, there can still be convergence towards a conformation despite high accuracy. A general rule of thumb is that regions with sequence homology over 30% are likely to be accurate. You will probably have a generally good structure if there is 20-30% homology, but have trouble with some details. Under 20%, there will likely be many inaccurate details.
8a) Calculate the backbone RMSD between each of the 200 structures:
Each of the 200 predicted structures are aligned against each other and the overall backbone RMSD is calculated for each pair, as shown in the image above.
8b) Cluster structures into groups based on their RMSD:
A distance matrix for the structures and their RMSD to each other is used to cluster structures into groups based on their RMSD.
8c) The best structure from each of the top five clusters is presented in HM results:
In the results file for your HM run only five structures will be available to view and download. Each structure represents one of the five best clusters of structures. As shown in the image below, the clusters are ranked by their mean structure score. The lowest scoring structure is then selected to represent each one; lowest scoring structures are circled in the image below.