This tutorial has two parts, focusing on two related parts of *ProDy* for studying evolution:

- The sequence sub-package
*Evol*is for fetching, parsing and refining multiple sequence alignments (MSAs), and calculating residue-level properties such as conservation and coevolution as well as sequence-level properties such as percentage identity.

- The signature dynamics module
*SignDy*calculates ENM normal modes for ensembles of related protein structures and evaluates the conservation and differentiation of signature dynamics across families and subfamilies. It also allows classification of ensemble/family members based upon their dynamics, allowing the evolution of protein dynamics to be compared with the evolution of sequence and structure.

We first make the required imports:

In [1]:

```
from prody import *
from pylab import *
%matplotlib inline
confProDy(auto_show=False)
```

@> ProDy is configured: auto_show=False

We also configure *ProDy* to put all the PDB files in a particular folder seeing as there are so many of them.

In [2]:

```
pathPDBFolder('./pdbs/')
```

The protein families database Pfam provides multiple sequence alignments of related protein domains, which we are often used as starting points for sequence evolution analyses. We can fetch such MSAs using the function `fetchPfamMSA`

as follows:

In [53]:

```
filename = fetchPfamMSA('PF00074')
filename
```

@> Pfam MSA for PF00074 is written as PF00074_full.sth.

Out[53]:

'PF00074_full.sth'

We can then parse the MSA into *ProDy* using the `parseMSA`

function, which can handle various types of MSA files including Stockholm, SELEX, CLUSTAL, PIR and FASTA formats.

In [54]:

```
msa = parseMSA(filename)
msa
```

@> 1494 sequence(s) with 348 residues were parsed in 0.01s.

Out[54]:

<MSA: PF00074_full (1494 sequences, 348 residues)>

This alignment can be indexed to extract individual sequences (rows) and residue positions (columns):

In [55]:

```
msa[:10,:10]
```

Out[55]:

<MSA: PF00074_full (10 sequences, 10 residues)>

In [56]:

```
seq0 = msa[0]
seq0
```

Out[56]:

<Sequence: G1ST62_RABIT (PF00074_full[0]; length 348; 119 residues and 229 gaps)>

In [57]:

```
str(seq0)
```

Out[57]:

'..................................TKARWFEIQHIQP.NL.L.Q.---....--C...NR.AM..RG.V.NN......YT.........Q........HC..KP..FNTFL.H.D.........S.F......QD...V............AAV.....C...DF........P.N...V.TC....R........NG..RHNC....HQS....PK..PINMTNCRLT......-AGK..YP....D..CS..Y..S.D.A..T........Q.Y..K.F..FIV..A.CDpp.qkSDPP..YHLVPVHLD..........................'

This alignment contains many redundant sequences as well as lots of rows and columns with large numbers of gaps. Therefore, we refine it using `refineMSA`

, which we can do based on the sequence of RNAS1_BOVIN:

In [58]:

```
msa_refined = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)
msa_refined
```

@> Label refinement reduced number of columns from 348 to 119 in 0.00s. @> Row occupancy refinement reduced number of rows from 1494 to 1314 in 0.00s. @> Sequence identity refinement reduced number of rows from 1314 to 1054 in 0.28s.

Out[58]:

<MSA: PF00074_full refined (label=RNAS1_BOVIN, rowocc>=0.8, seqid>=0.98) (1054 sequences, 119 residues)>

We calculate use `calcShannonEntropy`

to calculate the entropy of the refined MSA, which is a measure of sequence variability.

Shannon's entropy measures the degree of uncertainty that exists in a system. In the case of multiple sequence alignments, the Shannon entropy of each protein site (column) can be computed according to:

$$H(p_1, p_2, \ldots, p_n) = -\sum_{i=1}^n p_i \log_2 p_i $$where $p_i$ is the frequency of amino acid $i$ in that site. If a column is completely conserved then Shannon entropy is 0. The maximum variability, where each amino acid occurs with frequency 1/20, yields an entropy of 4.32

In [59]:

```
entropy = calcShannonEntropy(msa_refined)
```

We can also show the Shannon entropy on a bar chart:

In [60]:

```
showShannonEntropy(msa_refined);
```

@> Label L5K5X3_PTEAL start-end entry matches length of ungapped sequence. Setting resnums 31 to 149

Next, we obtain residue fluctuations or mobility for a protein member of the above family using the GNM.

We will use chain B of PDB structure 2W5I, which corresponds to our reference sequence RNAS1_BOVIN.

In [61]:

```
ag = parsePDB('2W5I', chain='B')
ag
```

@> Connecting wwPDB FTP server RCSB PDB (USA). @> 2w5i downloaded (2w5i.pdb.gz) @> PDB download via FTP completed (1 downloaded, 0 failed). @> 973 atoms and 1 coordinate set(s) were parsed in 0.05s. @> Secondary structures were assigned to 75 residues.

Out[61]:

<AtomGroup: 2W5IB (973 atoms)>

The next step is to select the corresponding residues from the AtomGroup to match the sequence alignment. We can identify these using `alignSequenceToMSA`

. We give it the Calpha atoms only so the residue numbers aren't repeated.

In [62]:

```
aln, idx_1, idx_2 = alignSequenceToMSA(ag.ca, msa_refined, label='RNAS1_BOVIN')
showAlignment(aln, indices=[idx_1, idx_2])
```

We see that there are extra residues in the PDB sequence compared to the reference sequence so we identify their residue numbers to make a selection.

In [63]:

```
print(ag.ca.getResnums())
```

They are numbered from 1 to 124, two residues are missing from the beginning, and three residues are missing from the end, so we select residues 3 to 121. This now makes the two sequences match.

In [64]:

```
chB = ag.select('resid 3 to 121')
chB
```

Out[64]:

<Selection: 'resid 3 to 121' from 2W5IB (879 atoms)>

In [65]:

```
print(msa_refined['RNAS1_BOVIN'])
print(chB.ca.getSequence())
```

We perform GNM analysis as follows:

In [66]:

```
gnm = GNM('2W5I')
gnm.buildKirchhoff(chB.ca)
gnm.calcModes(n_modes=None) # calculate all modes
```

@> Kirchhoff was built in 0.01s. @> 118 modes were calculated in 0.00s.

We can then visually compare the behaviour at the individual residue level as follows:

In [67]:

```
mobility = calcSqFlucts(gnm)
```

In [68]:

```
figure(figsize=(13,6))
# plot entropy as grey bars
bar(chB.ca.getResnums(), entropy, width=1.2, color='grey', label='entropy');
# rescale mobility
mobility = mobility*(max(entropy)/max(mobility))
# plot mobility as a blue line
showAtomicLines(mobility, atoms=chB.ca, color='b', linewidth=2, label='mobility');
legend()
```

Out[68]:

<matplotlib.legend.Legend at 0x7f42fc6c1940>

In addition to the conservation/variation of individual positions, we can also calculate the coevolution between positions due to correlated mutations.

One simple and common method for this is to compute the mutual information between the columns in the MSA:

In [69]:

```
mutinfo = buildMutinfoMatrix(msa_refined)
```

@> Mutual information matrix was calculated in 0.04s.

In [70]:

```
showMutinfoMatrix(msa_refined, cmap='inferno');
title(None);
```

We can improve this with the widely used average product correction:

In [71]:

```
mi_apc = applyMutinfoCorr(mutinfo)
```

In [72]:

```
showMatrix(mi_apc, cmap='inferno');
```

We can change the colour scale normalisation to eliminate the effect of the diagonal. However, the mutual information matrix is still pretty noisy.

In [73]:

```
showMatrix(mi_apc, cmap='inferno', norm=Normalize(0, 0.5));
```

Therefore, more sophisticated analyses have also been developed including the Direct Information (DI; also known as direct coupling analysis (DCA), which is very successful for **contact prediction**. This method can also be used in *ProDy* as follows:

In [74]:

```
di = buildDirectInfoMatrix(msa_refined)
```

@> DI matrix was calculated in 0.73s.

In [75]:

```
showDirectInfoMatrix(msa_refined, cmap='inferno');
title(None);
```

If we compare the brighter regions on this map to the contact matrix then we see that they indeed match pretty well:

In [76]:

```
showContactMap(gnm, origin='lower', cmap='Greys');
```

We can also apply a rank-ordering to the DI and corrected MI matrix entries, which helps identify the strongest signals:

In [77]:

```
di_rank_row, di_rank_col, di_zscore_sort = calcRankorder(di, zscore=True)
print('row: ', di_rank_row[:5])
print('column:', di_rank_col[:5])
```

@> Zscore normalization has been applied. @> Matrix is symmetric, only lower triangle indices will be returned.

row: [ 79 92 64 69 110] column: [ 45 37 63 62 109]

In [78]:

```
mi_rank_row, mi_rank_col, mi_zscore_sort = calcRankorder(mi_apc, zscore=True)
print('row: ', mi_rank_row[:5])
print(
```