This tutorial has two parts, focusing on two related parts of *ProDy* for studying evolution:

- The sequence sub-package
*Evol*is for fetching, parsing and refining multiple sequence alignments (MSAs), and calculating residue-level properties such as conservation and coevolution as well as sequence-level properties such as percentage identity.

- The signature dynamics module
*SignDy*calculates ENM normal modes for ensembles of related protein structures and evaluates the conservation and differentiation of signature dynamics across families and subfamilies. It also allows classification of ensemble/family members based upon their dynamics, allowing the evolution of protein dynamics to be compared with the evolution of sequence and structure.

We first make the required imports:

In [1]:

```
from prody import *
from pylab import *
%matplotlib inline
confProDy(auto_show=False)
```

We also configure *ProDy* to put all the PDB files in a particular folder seeing as there are so many of them.

In [2]:

```
pathPDBFolder('./pdbs/')
```

The protein families database Pfam provides multiple sequence alignments of related protein domains, which we are often used as starting points for sequence evolution analyses. We can fetch such MSAs using the function `fetchPfamMSA`

as follows:

In [3]:

```
filename = fetchPfamMSA('PF00074')
filename
```

Out[3]:

We can then parse the MSA into *ProDy* using the `parseMSA`

function, which can handle various types of MSA files including Stockholm, SELEX, CLUSTAL, PIR and FASTA formats.

In [4]:

```
msa = parseMSA(filename)
msa
```

Out[4]:

This alignment can be indexed to extract individual sequences (rows) and residue positions (columns):

In [5]:

```
msa[:10,:10]
```

Out[5]:

In [6]:

```
seq0 = msa[0]
seq0
```

Out[6]:

In [7]:

```
str(seq0)
```

Out[7]:

This alignment contains many redundant sequences as well as lots of rows and columns with large numbers of gaps. Therefore, we refine it using `refineMSA`

, which we can do based on the sequence of RNAS1_BOVIN:

In [8]:

```
msa_refined = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)
msa_refined
```

Out[8]:

We calculate use `calcShannonEntropy`

to calculate the entropy of the refined MSA, which is a measure of sequence variability.

Shannon's entropy measures the degree of uncertainty that exists in a system. In the case of multiple sequence alignments, the Shannon entropy of each protein site (column) can be computed according to:

$$H(p_1, p_2, \ldots, p_n) = -\sum_{i=1}^n p_i \log_2 p_i $$where $p_i$ is the frequency of amino acid $i$ in that site. If a column is completely conserved then Shannon entropy is 0. The maximum variability, where each amino acid occurs with frequency 1/20, yields an entropy of 4.32

In [9]:

```
entropy = calcShannonEntropy(msa_refined)
```

We can also show the Shannon entropy on a bar chart:

In [10]:

```
showShannonEntropy(msa_refined);
```

Next, we obtain residue fluctuations or mobility for a protein member of the above family using the GNM.

We will use chain B of PDB structure 2W5I, which corresponds to our reference sequence RNAS1_BOVIN.

In [11]:

```
ag = parsePDB('2W5I', chain='B')
ag
```

Out[11]:

The next step is to select the corresponding residues from the AtomGroup to match the sequence alignment. We can identify these using `alignSequenceToMSA`

. We give it the Calpha atoms only so the residue numbers aren't repeated.

In [12]:

```
aln, idx_1, idx_2 = alignSequenceToMSA(ag.ca, msa_refined, label='RNAS1_BOVIN')
showAlignment(aln, indices=[idx_1, idx_2])
```

We see that there are extra residues in the PDB sequence compared to the reference sequence so we identify their residue numbers to make a selection.

In [13]:

```
print(ag.ca.getResnums())
```

They are numbered from 1 to 124, two residues are missing from the beginning, and three residues are missing from the end, so we select residues 3 to 121. This now makes the two sequences match.

In [14]:

```
chB = ag.select('resid 3 to 121')
chB
```

Out[14]:

In [15]:

```
print(msa_refined['RNAS1_BOVIN'])
print(chB.ca.getSequence())
```

We perform GNM analysis as follows:

In [16]:

```
gnm = GNM('2W5I')
gnm.buildKirchhoff(chB.ca)
gnm.calcModes(n_modes=None) # calculate all modes
```

We can then visually compare the behaviour at the individual residue level as follows:

In [17]:

```
mobility = calcSqFlucts(gnm)
```

In [18]:

```
figure(figsize=(13,6))
# plot entropy as grey bars
bar(chB.ca.getResnums(), entropy, width=1.2, color='grey', label='entropy');
# rescale mobility
mobility = mobility*(max(entropy)/max(mobility))
# plot mobility as a blue line
showAtomicLines(mobility, atoms=chB.ca, color='b', linewidth=2, label='mobility');
legend()
```

Out[18]:

In addition to the conservation/variation of individual positions, we can also calculate the coevolution between positions due to correlated mutations.

One simple and common method for this is to compute the mutual information between the columns in the MSA:

In [19]:

```
mutinfo = buildMutinfoMatrix(msa_refined)
```

In [20]:

```
showMutinfoMatrix(msa_refined, cmap='inferno');
title(None);
```

We can improve this with the widely used average product correction:

In [21]:

```
mi_apc = applyMutinfoCorr(mutinfo)
```

In [22]:

```
showMatrix(mi_apc, cmap='inferno');
```

We can change the colour scale normalisation to eliminate the effect of the diagonal. However, the mutual information matrix is still pretty noisy.

In [23]:

```
showMatrix(mi_apc, cmap='inferno', norm=Normalize(0, 0.5));
```

Therefore, more sophisticated analyses have also been developed including the Direct Information (DI; also known as direct coupling analysis (DCA), which is very successful for **contact prediction**. This method can also be used in *ProDy* as follows:

In [24]:

```
di = buildDirectInfoMatrix(msa_refined)
```

In [25]:

```
showDirectInfoMatrix(msa_refined, cmap='inferno');
title(None);
```

If we compare the brighter regions on this map to the contact matrix then we see that they indeed match pretty well:

In [26]:

```
showContactMap(gnm, origin='lower', cmap='Greys');
```

We can also apply a rank-ordering to the DI and corrected MI matrix entries, which helps identify the strongest signals:

In [27]:

```
di_rank_row, di_rank_col, di_zscore_sort = calcRankorder(di, zscore=True)
print('row: ', di_rank_row[:5])
print('column:', di_rank_col[:5])
```

In [28]:

```
mi_rank_row, mi_rank_col, mi_zscore_sort = calcRankorder(mi_apc, zscore=True)
print('row: ', mi_rank_row[:5])
print('column:', mi_rank_col[:5])
```

This tutorial describes how to calculate signature dynamics for a family of proteins with similar structures using Elastic Network Models (ENMs). This method (also called ensemble normal mode analysis) creates an ensemble of aligned structures and calculates statistics such as means and standard deviations on various dynamic properties including mode profiles, mean square fluctuations and cross-correlation matrices. It also includes tools for classifying family members based on their sequence, structure and dynamics.

The theory and usage of this toolkit is described in our recent paper:

Zhang S, Li H, Krieger J, Bahar I.
Shared signature dynamics tempered by local fluctuations enables fold adaptability and specificity. *Mol. Biol. Evol.* **2019** 36(9):2053â€“2068

In this tutorial, we will have a quick walk-through on the *SignDy* calculations and functions using the example of type-I
periplasmic binding protein (PBP-I) domains. The data is collected using the Dali server (http://ekhidna2.biocenter.helsinki.fi/dali/).

Holm L, RosenstrÃ¶m P.
Dali server: conservation mapping in 3D.
*Nucleic Acids Res.* **2010** 10(38):W545-9

In addition to the previous imports, we also import time so that we can use the `sleep`

function to reduce the load on the Dali server.

In [29]:

```
import time
```

The first step in signature dynamics analysis is to collect a set of related protein structures and build a `PDBEnsemble`

. This can be achieved by multiple routes: a query search of the PDB using `blastPDB`

or Dali, extraction of PDB IDs from the Pfam database (as above) or the CATH database, or input of a pre-defined list.

We demonstrate the Dali method here in the first part of the tutorial. The usage of CATH methods is described in the website tutorial and the function `blastPDB`

is described in the Structure Analysis Tutorial.

We apply these methods to the PBP-I domains, a group of protein structures originally found in bacteria for transport of solutes across the periplasmic space and later seen in various eukaryotic receptors including ionotropic and metabotropic glutamate receptors. We use the N-terminal domain of AMPA receptor subunit GluA2 (gene name GRIA2; https://www.uniprot.org/uniprot/P42262) as a query.

The second step is then to calculate ENM normal modes for all members of the `PDBEnsemble`

, creating a `ModeEnsemble`

. We usually use the GNM for this as will be shown here, but the ANM can be used too.

The third step is then to analyse conserved and divergent behaviours to identify signature dynamics of the whole family or individual subfamilies. This is aided calculations of overlaps and distances between the mode spectra (step 4), which can be used to create phylogenetic trees that can be compared to sequence and structural conservation and divergence.

First we use the function `searchDali`

to search the PDB with Dali, which returns a `DaliRecord`

object that contains a list of PDB IDs and their corresponding mappings to the reference structure.

In [30]:

```
dali_rec = searchDali('3H5V','A')
dali_rec
```

Out[30]:

The Dali search often remains in the queue longer than the timeout time. We therefore have a `fetch`

method, which can be run later to fetch the data. We can run this in a loop with a wait of a couple of minutes in between fetches to make sure we get the result.

In [31]:

```
while not dali_rec.isSuccess:
dali_rec.fetch()
time.sleep(120)
dali_rec
```

Out[31]:

Next, we get the lists of PDB IDs and mappings from *dali_rec*, and parse the *pdb_ids* to get a list of `AtomGroup`

instances:

In [32]:

```
pdb_ids = dali_rec.filter(cutoff_len=0.7, cutoff_rmsd=1.0, cutoff_Z=30)
```

In [33]:

```
mappings = dali_rec.getMappings()
```

In [34]:

```
ags = parsePDB(pdb_ids, subset='ca')
len(ags)
```

Out[34]:

Then we provide *ags* together with *mappings* to `buildPDBEnsemble`

. We set the keyword argument `seqid=20`

to account for the low sequence identity between some of the structures.

In [35]:

```
dali_ens = buildPDBEnsemble(ags, mapping=mappings, seqid=20, labels=pdb_ids)
dali_ens
```

Out[35]:

Finally, we save the ensemble for later processing:

In [36]:

```
saveEnsemble(dali_ens, 'PBP-I')
```

Out[36]:

For this analysis we'll build a `ModeEnsemble`

by calculating normal modes for each member of the `PDBEnsemble`

.

You can load a PDB ensemble at this stage if you already have one. We demonstrate this for the one we just saved.

In [37]:

```
dali_ens = loadEnsemble('PBP-I.ens.npz')
```

Then we calculate `GNM`

modes for each member of the ensemble using `calcEnsembleENMs`

. There are options to select the *model* (`GNM`

by default) and the way of considering non-aligned residues by setting the *trim* option (default is `reduceModel`

, which treats them as environment).

In [38]:

```
gnms = calcEnsembleENMs(dali_ens, model='GNM', trim='reduce')
gnms
```

Out[38]:

We can save the mode ensemble as follows:

In [39]:

```
saveModeEnsemble(gnms, 'PBP-I')
```

Out[39]:

We can also load in a previously saved mode ensemble such as the one we saved above:

In [40]:

```
gnms = loadModeEnsemble('PBP-I.modeens.npz')
```

We can index the `ModeEnsemble`

object in two different dimensions. The first dimension corresponds to ensemble members as shown below for extracting the mode set for the first member (numbered 0).

In [41]:

```
gnms[0]
```

Out[41]:

The second dimension corresponds to particular modes of all ensemble members as shown below for extracting the first mode (numbered 0). The colon means we select everything from the first dimension.

In [42]:

```
gnms[:,0]
```

Out[42]:

We can also slice out ranges of members and modes and index them both at the same time. E.g. to get the five members from 5 up to but not including 10 (5, 6, 7, 8, 9), and the two modes from 2 up to but not including 4 (modes with indices 2 and 3 in the reference), we'd use the following code.

In [43]:

```
gnms[5:10,2:4]
```

Out[43]:

We can also use indexing to extract individual modes from individual members, e.g.

In [44]:

```
gnms[5,2]
```

Out[44]:

Remember that we usually talk about modes counting from 1 so this is "Mode 3" or "the 3rd global mode" in conversation but Python counts from 0 so it has index 2. Likewise this is the "6th member" of the ensemble but has index 5.

Signatures are calculated as the mean and standard deviation of various properties such as mode shapes and mean square fluctations.

For example, we can show the average and standard deviation of the shape of the first mode (second index 0). The first index of the mode ensemble is over conformations.

In [45]:

```
showSignatureMode(gnms[:, 0]);
```

We can also show such results for properties involving multiple modes such as the mean square fluctuations from the first 5 modes or the cross-correlations from the first 20.

In [46]:

```
showSignatureSqFlucts(gnms[:, :5]);
```

In [47]:

```
showSignatureCrossCorr(gnms[:, :20]);
```

We can also look at distributions over values across different members of the ensemble such as inverse eigenvalue. We can show a bar above this with individual members labelled like in

Krieger J, Bahar I, Greger IH.
Structure, Dynamics, and Allosteric Potential of Ionotropic Glutamate Receptor N-Terminal Domains. *Biophys. J.* **2015** 109(6):1136-48.

In this automated version, the bar is coloured from white to dark red depending on how many structures have values at that point.

We can select particular members to highlight with arrows by putting their names and labels in a dictionary:

In [48]:

```
highlights = {'3h5vA': 'GluA2','3o21C': 'GluA3',
'3h6gA': 'GluK2', '3olzA': 'GluK3',
'5kc8A': 'GluD2'}
```

We plot the variance bar for the first five modes (showing a function of the inverse eigenvalues related to the resultant relative size of motion) above the inverse eigenvalue distributions for each of those modes. To arrange the plots like this, we use the `GridSpec`

function of Matplotlib.

In [49]:

```
gs = GridSpec(ncols=1, nrows=2, height_ratios=[1, 10], hspace=0.15)
subplot(gs[0]);
showVarianceBar(gnms[:, :5], fraction=True, highlights=highlights);
xlabel('');
subplot(gs[1]);
showSignatureVariances(gnms[:, :5], fraction=True, bins=80, alpha=0.7);
xlabel('Fraction of inverse eigenvalue');
```

We can also extract the eigenvalues and eigenvectors directly from the mode ensemble and analyse them ourselves:

In [50]:

```
eigvals = gnms.getEigvals()
eigvals
```

Out[50]:

In [51]:

```
eigvecs = gnms.getEigvecs()
eigvecs
```

Out[51]:

These are stored in instances of the `sdarray`

class that we designed specifically for signature dynamics analysis. It is an extension of the standard NumPy `ndarray`

but has additional attributes and some modified methods. The first axis is reserved for ensemble members and the mean, min, max and std are altered to average over this dimension rather than all dimensions.

We can look at the shape of these arrays and index them just like `ndarray`

and `ModeEnsemble`

objects. The eigenvalues are arranged in **eigvals** such that the first axis is the members and the second is the modes as in the mode ensemble.

In [52]:

```
eigvals.shape
```

Out[52]:

In [53]:

```
eigvals[0:5,0:5]
```

Out[53]:

The eigenvectors are arranged in **eigvecs** such that the first axis is over the members, and the remaining dimensions are as in other eigenvector arrays - the second is over atoms and the third is mode index. Each atom has a weight, which varies between members and is important in calculating the mean, std, etc.

In [54]:

```
eigvecs.shape
```

Out[54]:

Spectral overlap, also known as covariance overlap, measures the overlap between two covariance matrices, or the overlap of a subset of the modes (a mode spectrum). This can also be converted into a distance using its arccosine as will be shown below.

We can calculate a matrix of spectral overlaps (**so_matrix**) over any slice of the ModeEnsemble that is still a mode ensemble itself, e.g.

In [55]:

```
so_matrix = calcEnsembleSpectralOverlaps(gnms[:, :1])
```

In [56]:

```
figure(figsize=(8,8))
showMatrix(so_matrix);
```