Evolution of sequence, structure and dynamics with Evol and SignDy

This tutorial has two parts, focusing on two related parts of ProDy for studying evolution:

  1. The sequence sub-package Evol is for fetching, parsing and refining multiple sequence alignments (MSAs), and calculating residue-level properties such as conservation and coevolution as well as sequence-level properties such as percentage identity.
  1. The signature dynamics module SignDy calculates ENM normal modes for ensembles of related protein structures and evaluates the conservation and differentiation of signature dynamics across families and subfamilies. It also allows classification of ensemble/family members based upon their dynamics, allowing the evolution of protein dynamics to be compared with the evolution of sequence and structure.

We first make the required imports:

In [1]:
from prody import *
from pylab import *
%matplotlib inline
confProDy(auto_show=False)
@> ProDy is configured: auto_show=False

We also configure ProDy to put all the PDB files in a particular folder seeing as there are so many of them.

In [2]:
pathPDBFolder('./pdbs/')
@> Local PDB folder is set: 'C:\\Users\\burak\\Desktop\\wormat\\pdbs'
@> A plain folder structure will be assumed.

1. Sequence evolution with Evol

Fetching, parsing and refining MSAs from Pfam

The protein families database Pfam provides multiple sequence alignments of related protein domains, which we are often used as starting points for sequence evolution analyses. We can fetch such MSAs using the function fetchPfamMSA as follows:

In [3]:
filename = fetchPfamMSA('PF00074')
filename
@> Pfam MSA for PF00074 is written as .\PF00074_full.sth.
Out[3]:
'.\\PF00074_full.sth'

We can then parse the MSA into ProDy using the parseMSA function, which can handle various types of MSA files including Stockholm, SELEX, CLUSTAL, PIR and FASTA formats.

In [4]:
msa = parseMSA(filename)
msa
@> 1187 sequence(s) with 316 residues were parsed in 0.19s.
Out[4]:
<MSA: PF00074_full (1187 sequences, 316 residues)>

This alignment can be indexed to extract individual sequences (rows) and residue positions (columns):

In [5]:
msa[:10,:10]
Out[5]:
<MSA: PF00074_full (10 sequences, 10 residues)>
In [6]:
seq0 = msa[0]
seq0
Out[6]:
<Sequence: G1ST62_RABIT (PF00074_full[0]; length 316; 119 residues and 197 gaps)>
In [7]:
str(seq0)
Out[7]:
'............................TKARWFEIQHIQP.NL.L.Q.---....--C...NRAM.RG.......V.NN......YT.........Q......HC.K..PFNTFL......H.D.........S.F......QD...V.AAV.....C...DF........P.N...V.T.C..R........NG..RHNC....HQS....PKPINMTNCRLT......-AGK..YPD.CSY.S.D.A.....T.Q.Y.K.F.FIV..ACDpp.qkSDPP..YHLVPVHLD.......................'

This alignment contains many redundant sequences as well as lots of rows and columns with large numbers of gaps. Therefore, we refine it using refineMSA, which we can do based on the sequence of RNAS1_BOVIN:

In [8]:
msa_refined = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)
msa_refined
@> Label refinement reduced number of columns from 316 to 119 in 0.00s.
@> Row occupancy refinement reduced number of rows from 1187 to 1042 in 0.01s.
@> Sequence identity refinement reduced number of rows from 1042 to 857 in 0.23s.
Out[8]:
<MSA: PF00074_full refined (label=RNAS1_BOVIN, rowocc>=0.8, seqid>=0.98) (857 sequences, 119 residues)>

Measuring sequence conservation with Shannon entropy

We calculate use calcShannonEntropy to calculate the entropy of the refined MSA, which is a measure of sequence variability.

Shannon's entropy measures the degree of uncertainty that exists in a system. In the case of multiple sequence alignments, the Shannon entropy of each protein site (column) can be computed according to:

$$H(p_1, p_2, \ldots, p_n) = -\sum_{i=1}^n p_i \log_2 p_i $$

where $p_i$ is the frequency of amino acid $i$ in that site. If a column is completely conserved then Shannon entropy is 0. The maximum variability, where each amino acid occurs with frequency 1/20, yields an entropy of 4.32

In [9]:
entropy = calcShannonEntropy(msa_refined)

We can also show the Shannon entropy on a bar chart:

In [10]:
showShannonEntropy(msa_refined);

Comparisons of sequence evolution and structural dynamics

Next, we obtain residue fluctuations or mobility for a protein member of the above family using the GNM.

We will use chain B of PDB structure 2W5I, which corresponds to our reference sequence RNAS1_BOVIN.

In [11]:
ag = parsePDB('2W5I', chain='B')
ag
@> Connecting wwPDB FTP server RCSB PDB (USA).
@> 2w5i downloaded (C:\Users\burak\...\2w5i.pdb.gz)
@> PDB download via FTP completed (1 downloaded, 0 failed).
@> 973 atoms and 1 coordinate set(s) were parsed in 0.03s.
@> Secondary structures were assigned to 75 residues.
Out[11]:
<AtomGroup: 2W5IB (973 atoms)>

The next step is to select the corresponding residues from the AtomGroup to match the sequence alignment. We can identify these using alignSequenceToMSA. We give it the Calpha atoms only so the residue numbers aren't repeated.

In [12]:
aln, idx_1, idx_2 = alignSequenceToMSA(ag.ca, msa_refined, label='RNAS1_BOVIN')
showAlignment(aln, indices=[idx_1, idx_2])
               	                  20        30        40        50        60
2W5IB          	KETAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ

               	                  18        28        38        48        58
RNAS1_BOVIN    	--TAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQ


               	        70        80        90       100       110       120
2W5IB          	KNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF

               	        68        78        88        98       108       118
RNAS1_BOVIN    	KNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHF


               	                                                            
2W5IB          	DASV

               	                                                            
RNAS1_BOVIN    	D---

We see that there are extra residues in the PDB sequence compared to the reference sequence so we identify their residue numbers to make a selection.

In [13]:
print(ag.ca.getResnums())
[  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124]

They are numbered from 1 to 124, two residues are missing from the beginning, and three residues are missing from the end, so we select residues 3 to 121. This now makes the two sequences match.

In [14]:
chB = ag.select('resid 3 to 121')
chB
Out[14]:
<Selection: 'resid 3 to 121' from 2W5IB (879 atoms)>
In [15]:
print(msa_refined['RNAS1_BOVIN'])
print(chB.ca.getSequence())
tAAAKFERQHMDSSTSAASSsNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHFD
TAAAKFERQHMDSSTSAASSSNYCNQMMKSRNLTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQANKHIIVACEGNPYVPVHFD

We perform GNM analysis as follows:

In [16]:
gnm = GNM('2W5I')
gnm.buildKirchhoff(chB.ca)
gnm.calcModes(n_modes=None)  # calculate all modes
@> Kirchhoff was built in 0.01s.
@> 118 modes were calculated in 0.02s.

We can then visually compare the behaviour at the individual residue level as follows:

In [17]:
mobility = calcSqFlucts(gnm)
In [18]:
figure(figsize=(13,6))

# plot entropy as grey bars
bar(chB.ca.getResnums(), entropy, width=1.2, color='grey', label='entropy');

# rescale mobility
mobility = mobility*(max(entropy)/max(mobility))

# plot mobility as a blue line
showAtomicLines(mobility, atoms=chB.ca, color='b', linewidth=2, label='mobility');

legend()
Out[18]:
<matplotlib.legend.Legend at 0x1ea9f1b8790>

Coevolution Calculation

In addition to the conservation/variation of individual positions, we can also calculate the coevolution between positions due to correlated mutations.

One simple and common method for this is to compute the mutual information between the columns in the MSA:

In [19]:
mutinfo = buildMutinfoMatrix(msa_refined)
@> Mutual information matrix was calculated in 0.07s.
In [20]:
showMutinfoMatrix(msa_refined, cmap='inferno');
title(None);
@> Mutual information matrix was calculated in 0.07s.

We can improve this with the widely used average product correction:

In [21]:
mi_apc = applyMutinfoCorr(mutinfo)
In [22]:
showMatrix(mi_apc, cmap='inferno');

We can change the colour scale normalisation to eliminate the effect of the diagonal. However, the mutual information matrix is still pretty noisy.

In [23]:
showMatrix(mi_apc, cmap='inferno', norm=Normalize(0, 0.5));

Therefore, more sophisticated analyses have also been developed including the Direct Information (DI; also known as direct coupling analysis (DCA), which is very successful for contact prediction. This method can also be used in ProDy as follows:

In [24]:
di = buildDirectInfoMatrix(msa_refined)
@> DI matrix was calculated in 2.14s.
In [25]:
showDirectInfoMatrix(msa_refined, cmap='inferno');
title(None);
@> DI matrix was calculated in 4.73s.

If we compare the brighter regions on this map to the contact matrix then we see that they indeed match pretty well:

In [26]:
showContactMap(gnm, origin='lower', cmap='Greys');

We can also apply a rank-ordering to the DI and corrected MI matrix entries, which helps identify the strongest signals:

In [27]:
di_rank_row, di_rank_col, di_zscore_sort = calcRankorder(di, zscore=True)
print('row:   ', di_rank_row[:5])
print('column:', di_rank_col[:5])
@> Zscore normalization has been applied.
@> Matrix is symmetric, only lower triangle indices will be returned.
row:    [79 92 64 69  2]
column: [45 37 63 62  1]
In [28]:
mi_rank_row, mi_rank_col, mi_zscore_sort = calcRankorder(mi_apc, zscore=True)
print('row:   ', mi_rank_row[:5])
print('column:', mi_rank_col[:5])
@> Zscore normalization has been applied.
@> Matrix is symmetric, only lower triangle indices will be returned.
row:    [ 79  92 115  69 110]
column: [ 45  37 114  62 109]

2. Signature Dynamics analysis with SignDy

This tutorial describes how to calculate signature dynamics for a family of proteins with similar structures using Elastic Network Models (ENMs). This method (also called ensemble normal mode analysis) creates an ensemble of aligned structures and calculates statistics such as means and standard deviations on various dynamic properties including mode profiles, mean square fluctuations and cross-correlation matrices. It also includes tools for classifying family members based on their sequence, structure and dynamics.

The theory and usage of this toolkit is described in our recent paper:

Zhang S, Li H, Krieger J, Bahar I. Shared signature dynamics tempered by local fluctuations enables fold adaptability and specificity. Mol. Biol. Evol. 2019 36(9):2053–2068

In this tutorial, we will have a quick walk-through on the SignDy calculations and functions using the example of type-I periplasmic binding protein (PBP-I) domains. The data is collected using the Dali server (http://ekhidna2.biocenter.helsinki.fi/dali/).

Holm L, Rosenström P. Dali server: conservation mapping in 3D. Nucleic Acids Res. 2010 10(38):W545-9

In addition to the previous imports, we also import time so that we can use the sleep function to reduce the load on the Dali server.

In [29]:
import time

Overview

The first step in signature dynamics analysis is to collect a set of related protein structures and build a PDBEnsemble. This can be achieved by multiple routes: a query search of the PDB using blastPDB or Dali, extraction of PDB IDs from the Pfam database (as above) or the CATH database, or input of a pre-defined list.

We demonstrate the Dali method here in the first part of the tutorial. The usage of CATH methods is described in the website tutorial and the function blastPDB is described in the Structure Analysis Tutorial.

We apply these methods to the PBP-I domains, a group of protein structures originally found in bacteria for transport of solutes across the periplasmic space and later seen in various eukaryotic receptors including ionotropic and metabotropic glutamate receptors. We use the N-terminal domain of AMPA receptor subunit GluA2 (gene name GRIA2; https://www.uniprot.org/uniprot/P42262) as a query.

The second step is then to calculate ENM normal modes for all members of the PDBEnsemble, creating a ModeEnsemble. We usually use the GNM for this as will be shown here, but the ANM can be used too.

The third step is then to analyse conserved and divergent behaviours to identify signature dynamics of the whole family or individual subfamilies. This is aided calculations of overlaps and distances between the mode spectra (step 4), which can be used to create phylogenetic trees that can be compared to sequence and structural conservation and divergence.

Step 1: Prepare Ensemble (using Dali)

First we use the function searchDali to search the PDB with Dali, which returns a DaliRecord object that contains a list of PDB IDs and their corresponding mappings to the reference structure.

In [30]:
dali_rec = searchDali('3H5V','A')
dali_rec
@> Submitted Dali search for PDB "3H5VA".
@> http://ekhidna2.biocenter.helsinki.fi/barcosel/tmp//3H5VA/
@> Dali results were fetched in 0.3s.   
@> Obtained 3595 PDB chains from Dali for 3H5VA.
Out[30]:
<prody.database.dali.DaliRecord at 0x1ea9fcd2370>

The Dali search often remains in the queue longer than the timeout time. We therefore have a fetch method, which can be run later to fetch the data. We can run this in a loop with a wait of a couple of minutes in between fetches to make sure we get the result.

In [31]:
while not dali_rec.isSuccess:
    dali_rec.fetch()
    time.sleep(120)
    
dali_rec
Out[31]:
<prody.database.dali.DaliRecord at 0x1ea9fcd2370>

Next, we get the lists of PDB IDs and mappings from dali_rec, and parse the pdb_ids to get a list of AtomGroup instances:

In [32]:
pdb_ids = dali_rec.filter(cutoff_len=0.7, cutoff_rmsd=1.0, cutoff_Z=30)
@> 3446 PDBs have been filtered out from 3595 Dali hits (remaining: 149).
In [33]:
mappings = dali_rec.getMappings()
In [34]:
ags = parsePDB(pdb_ids, subset='ca')
len(ags)
@> 149 PDBs were parsed in 63.57s.
Out[34]:
149

Then we provide ags together with mappings to buildPDBEnsemble. We set the keyword argument seqid=20 to account for the low sequence identity between some of the structures.

In [35]:
dali_ens = buildPDBEnsemble(ags, mapping=mappings, seqid=20, labels=pdb_ids)
dali_ens
@> Mapping 5weoB_ca to the reference... [  2%]@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5welB_ca to the reference... [  2%]@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm1D_ca to the reference... [  3%] 16s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm2B_ca to the reference... [  4%] 16s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dlzD_ca to the reference... [  5%] 14s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5weoD_ca to the reference... [  6%] 14s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6qkzC_ca to the reference... [  6%] 14s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l1eB_ca to the reference... [  8%] 13s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njnC_ca to the reference... [  8%] 13s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l1fB_ca to the reference... [  9%] 13s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6o9gA_ca to the reference... [ 11%] 12s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njnA_ca to the reference... [ 13%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6kzmA_ca to the reference... [ 14%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6l6fB_ca to the reference... [ 14%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njlC_ca to the reference... [ 16%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5weoA_ca to the reference... [ 16%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5welC_ca to the reference... [ 17%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njmC_ca to the reference... [ 18%] 11s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4uqqD_ca to the reference... [ 18%] 10s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm0C_ca to the reference... [ 19%] 10s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5welD_ca to the reference... [ 20%] 10s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njmA_ca to the reference... [ 21%] 10s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5kuhD_ca to the reference... [ 24%] 10s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4uqqB_ca to the reference... [ 26%] 9s @> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6jfzA_ca to the reference... [ 26%] 9s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l1hB_ca to the reference... [ 27%] 9s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5wenC_ca to the reference... [ 29%] 9s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6lu9D_ca to the reference... [ 30%] 9s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm1C_ca to the reference... [ 32%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dlzA_ca to the reference... [ 34%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5kc9A_ca to the reference... [ 34%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5ideD_ca to the reference... [ 35%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm0B_ca to the reference... [ 36%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6kzmC_ca to the reference... [ 38%] 8s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5kc9B_ca to the reference... [ 40%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u1xB_ca to the reference... [ 41%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6o9gD_ca to the reference... [ 43%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm2A_ca to the reference... [ 44%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l1gB_ca to the reference... [ 44%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5welA_ca to the reference... [ 46%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6qkzA_ca to the reference... [ 47%] 7s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u1xA_ca to the reference... [ 49%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6lu9A_ca to the reference... [ 51%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5idfD_ca to the reference... [ 51%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6o9gC_ca to the reference... [ 52%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5kufA_ca to the reference... [ 53%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm1A_ca to the reference... [ 55%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm1B_ca to the reference... [ 55%] 6s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6jfyD_ca to the reference... [ 57%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u1yA_ca to the reference... [ 58%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6jfzD_ca to the reference... [ 60%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm0D_ca to the reference... [ 61%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u1wD_ca to the reference... [ 62%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm0A_ca to the reference... [ 63%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dlzB_ca to the reference... [ 65%] 5s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6l6fA_ca to the reference... [ 67%] 4s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6jfyA_ca to the reference... [ 69%] 4s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6jfzB_ca to the reference... [ 70%] 4s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5wenB_ca to the reference... [ 73%] 4s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u1xD_ca to the reference... [ 73%] 4s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u2pA_ca to the reference... [ 74%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5weoC_ca to the reference... [ 75%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm2D_ca to the reference... [ 78%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dm2C_ca to the reference... [ 79%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 3td9A_ca to the reference... [ 80%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6dlzC_ca to the reference... [ 81%] 3s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6njlA_ca to the reference... [ 83%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6o9gB_ca to the reference... [ 85%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6ruqD_ca to the reference... [ 85%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l2eA_ca to the reference... [ 86%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5ideB_ca to the reference... [ 87%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6kzmB_ca to the reference... [ 87%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5wenA_ca to the reference... [ 88%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5idfB_ca to the reference... [ 89%] 2s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5kufC_ca to the reference... [ 94%] 1s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 4u5cB_ca to the reference... [ 95%] 1s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 6l6fC_ca to the reference... [ 97%] 1s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5l2eB_ca to the reference... [ 98%] 1s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Mapping 5wenD_ca to the reference... [ 99%] 1s@> WARNING no atommaps were available. Consider adjusting accepting criteria
@> Starting iterative superposition:             
@> Step #1: RMSD difference = 1.2488e+00
@> Step #2: RMSD difference = 1.1085e-02
@> Step #3: RMSD difference = 2.6526e-04
@> Step #4: RMSD difference = 7.9073e-06
@> Iterative superposition completed in 0.14s.
@> Final superposition to calculate transformations.
@> Superposition completed in 0.02 seconds.
@> Ensemble (70 conformations) were built in 11.53s.
@> WARNING 79 structures cannot be mapped.
Out[35]:
<PDBEnsemble: Unknown (70 conformations; 376 atoms)>

Finally, we save the ensemble for later processing:

In [36]:
saveEnsemble(dali_ens, 'PBP-I')
Out[36]:
'PBP-I.ens.npz'

Step 2: Mode ensemble

For this analysis we'll build a ModeEnsemble by calculating normal modes for each member of the PDBEnsemble.

You can load a PDB ensemble at this stage if you already have one. We demonstrate this for the one we just saved.

In [37]:
dali_ens = loadEnsemble('PBP-I.ens.npz')

Then we calculate GNM modes for each member of the ensemble using calcEnsembleENMs. There are options to select the model (GNM by default) and the way of considering non-aligned residues by setting the trim option (default is reduceModel, which treats them as environment).

In [38]:
gnms = calcEnsembleENMs(dali_ens, model='GNM', trim='reduce')
gnms
@> 20 GNM modes were calculated for each of the 70 conformations in 4.37s.
@> 20 modes across 70 modesets were matched in 0.50s.
Out[38]:
<ModeEnsemble: 70 modesets (20 modes, 376 atoms)>

We can save the mode ensemble as follows:

In [39]:
saveModeEnsemble(gnms, 'PBP-I')
Out[39]:
'PBP-I.modeens.npz'

We can also load in a previously saved mode ensemble such as the one we saved above:

In [40]:
gnms = loadModeEnsemble('PBP-I.modeens.npz')

Slicing and Indexing Mode Ensembles

We can index the ModeEnsemble object in two different dimensions. The first dimension corresponds to ensemble members as shown below for extracting the mode set for the first member (numbered 0).

In [41]:
gnms[0]
Out[41]:
<ModeSet: 20 modes from MaskedGNM 3h5vA reduced>

The second dimension corresponds to particular modes of all ensemble members as shown below for extracting the first mode (numbered 0). The colon means we select everything from the first dimension.

In [42]:
gnms[:,0]
Out[42]:
<ModeEnsemble: 70 modesets (1 mode, 376 atoms)>

We can also slice out ranges of members and modes and index them both at the same time. E.g. to get the five members from 5 up to but not including 10 (5, 6, 7, 8, 9), and the two modes from 2 up to but not including 4 (modes with indices 2 and 3 in the reference), we'd use the following code.

In [43]:
gnms[5:10,2:4]
Out[43]:
<ModeEnsemble: 5 modesets (2 modes, 376 atoms)>

We can also use indexing to extract individual modes from individual members, e.g.

In [44]:
gnms[5,2]
Out[44]:
<Mode: 6 from MaskedGNM 3qluA reduced>

Remember that we usually talk about modes counting from 1 so this is "Mode 3" or "the 3rd global mode" in conversation but Python counts from 0 so it has index 2. Likewise this is the "6th member" of the ensemble but has index 5.

Step 3: Signature dynamics

Signatures are calculated as the mean and standard deviation of various properties such as mode shapes and mean square fluctations.

For example, we can show the average and standard deviation of the shape of the first mode (second index 0). The first index of the mode ensemble is over conformations.

In [45]:
showSignatureMode(gnms[:, 0]);

We can also show such results for properties involving multiple modes such as the mean square fluctuations from the first 5 modes or the cross-correlations from the first 20.

In [46]:
showSignatureSqFlucts(gnms[:, :5]);
In [47]:
showSignatureCrossCorr(gnms[:, :20]);

We can also look at distributions over values across different members of the ensemble such as inverse eigenvalue. We can show a bar above this with individual members labelled like in

Krieger J, Bahar I, Greger IH. Structure, Dynamics, and Allosteric Potential of Ionotropic Glutamate Receptor N-Terminal Domains. Biophys. J. 2015 109(6):1136-48.

In this automated version, the bar is coloured from white to dark red depending on how many structures have values at that point.

We can select particular members to highlight with arrows by putting their names and labels in a dictionary:

In [48]:
highlights = {'3h5vA': 'GluA2','3o21C': 'GluA3',
              '3h6gA': 'GluK2', '3olzA': 'GluK3', 
              '5kc8A': 'GluD2'}

We plot the variance bar for the first five modes (showing a function of the inverse eigenvalues related to the resultant relative size of motion) above the inverse eigenvalue distributions for each of those modes. To arrange the plots like this, we use the GridSpec function of Matplotlib.

In [49]:
gs = GridSpec(ncols=1, nrows=2, height_ratios=[1, 10], hspace=0.15)

subplot(gs[0]);
showVarianceBar(gnms[:, :5], fraction=True, highlights=highlights);
xlabel('');

subplot(gs[1]);
showSignatureVariances(gnms[:, :5], fraction=True, bins=80, alpha=0.7);
xlabel('Fraction of inverse eigenvalue');

We can also extract the eigenvalues and eigenvectors directly from the mode ensemble and analyse them ourselves:

In [50]:
eigvals = gnms.getEigvals()
eigvals
Out[50]:
sdarray([[0.40332629, 1.09698756, 1.4466639 , ..., 5.30924955,
          5.5792069 , 5.77873354],
         [0.33585073, 1.37709037, 2.05877307, ..., 5.49544518,
          4.45554145, 6.00089776],
         [0.39625074, 1.03231662, 1.39041637, ..., 5.61184796,
          5.31761909, 5.20334578],
         ...,
         [0.39451721, 1.26321687, 1.50798271, ..., 5.76394814,
          5.45888474, 5.54173588],
         [0.38775641, 1.20703752, 1.59111616, ..., 4.77540064,
          5.60201866, 6.11600075],
         [0.39875231, 1.14861099, 1.58945168, ..., 5.73761851,
          5.68069539, 4.8536725 ]])
In [51]:
eigvecs = gnms.getEigvecs()
eigvecs
Out[51]:
sdarray([[[ 6.32823556e-02, -2.50192899e-02,  2.24757152e-02, ...,
           -2.75806462e-02, -2.91342858e-02,  2.35751313e-02],
          [ 6.14399172e-02, -2.55671762e-02,  1.82215962e-02, ...,
            2.37048870e-02,  1.30948891e-02,  8.00770496e-03],
          [ 6.03893156e-02, -2.75289674e-02,  1.47023932e-02, ...,
            6.27873120e-02,  4.50426791e-02,  7.48137751e-04],
          ...,
          [-3.45674867e-02,  5.36219952e-02, -8.18341574e-02, ...,
            3.73431044e-03, -1.56226528e-02,  1.48147066e-02],
          [-3.58459460e-02,  5.83491556e-02, -8.68632869e-02, ...,
            2.19044923e-04, -9.12061694e-03,  1.77561547e-02],
          [-3.86893737e-02,  6.78185627e-02, -1.09283894e-01, ...,
           -1.24293912e-01,  1.19842480e-01, -9.65913188e-02]],

         [[ 6.55399368e-02, -4.27319474e-02,  9.89071711e-02, ...,
           -9.53728657e-02, -3.43709792e-02,  1.39188570e-01],
          [ 6.35378511e-02, -3.84365552e-02,  7.20500291e-02, ...,
           -3.13283247e-02, -1.71957534e-03,  3.30300318e-02],
          [ 6.31173996e-02, -3.94593257e-02,  5.72773181e-02, ...,
            4.09686938e-02, -1.65581118e-02,  5.94580148e-02],
          ...,
          [-3.92495442e-02,  7.13035069e-02, -1.40226556e-01, ...,
            1.07809170e-02, -7.19132168e-04, -1.12456857e-02],
          [-4.14648492e-02,  8.89317211e-02, -2.03786500e-01, ...,
           -5.77904048e-03,  2.28664453e-02,  1.58767781e-03],
          [-4.30217016e-02,  9.95079546e-02, -2.52606805e-01, ...,
           -9.33489736e-02,  5.42239598e-02,  1.15549527e-02]],

         [[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
          [ 6.53380202e-02, -3.38623878e-02,  1.36489307e-02, ...,
           -7.68625718e-02, -8.80657489e-02,  2.61941773e-02],
          [ 6.49668875e-02, -3.76661574e-02,  1.31330580e-02, ...,
           -6.83373524e-02, -8.15130392e-02,  2.81657441e-02],
          ...,
          [-3.53740666e-02,  4.52002633e-02, -8.60796068e-02, ...,
           -8.76369561e-03,  5.79880419e-02,  2.61970647e-02],
          [-3.72555347e-02,  4.74531576e-02, -9.00106799e-02, ...,
            8.75617999e-03,  1.33878773e-01,  1.83615386e-02],
          [-3.97690638e-02,  5.20876285e-02, -1.00069887e-01, ...,
            6.47389907e-02,  4.30268520e-01,  2.50526966e-02]],

         ...,

         [[ 6.27972147e-02, -3.93116616e-02, -3.73235848e-03, ...,
           -9.05775754e-02, -2.75998215e-02, -2.90820022e-02],
          [ 6.33762550e-02, -2.87477573e-02,  3.61642084e-03, ...,
           -1.00128159e-01, -1.02858181e-01, -3.60300693e-02],
          [ 6.16665299e-02, -3.04506507e-02,  5.19394103e-03, ...,
           -3.76230760e-02, -3.03895864e-02, -4.36235997e-02],
          ...,
          [-4.02430288e-02,  3.74914833e-02, -9.74860558e-02, ...,
           -9.45730354e-03, -2.57004074e-02, -2.67623927e-02],
          [-4.04341629e-02,  4.18409508e-02, -9.47413386e-02, ...,
            9.63957399e-03, -2.74108989e-02, -1.32389184e-02],
          [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

         [[ 6.09276033e-02, -3.13783361e-02,  1.53718116e-02, ...,
            2.75040364e-02,  2.30271207e-02, -6.75812535e-02],
          [ 6.11083789e-02, -2.12443582e-02,  2.00598697e-02, ...,
            1.82373821e-03,  1.11641497e-02, -1.53660488e-01],
          [ 5.90401976e-02, -2.05543103e-02,  1.81249834e-02, ...,
            1.13954478e-02,  1.99273002e-02, -7.35295908e-02],
          ...,
          [-3.99169408e-02,  3.24769564e-02, -9.32559410e-02, ...,
            2.38863940e-02, -1.12087229e-01,  2.50907066e-02],
          [-4.03472009e-02,  3.67434488e-02, -8.76094095e-02, ...,
            7.70565759e-02, -2.67658215e-01,  4.01171547e-02],
          [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00]],

         [[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
            0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
          [ 6.26386283e-02, -2.51539903e-02,  4.06441675e-02, ...,
            1.24556612e-02,  9.82507745e-03,  1.48376078e-02],
          [ 6.10332991e-02, -2.65226588e-02,  3.23235560e-02, ...,
            5.70342641e-02,  9.77598491e-03, -7.11702459e-03],
          ...,
          [-3.48534987e-02,  4.07979825e-02, -1.02628475e-01, ...,
            6.61577028e-02, -4.15615160e-02,  8.79296801e-03],
          [-3.75181948e-02,  4.30152469e-02, -1.02249201e-01, ...,
            1.17856437e-01, -8.80188732e-02,  1.60250778e-02],
          [-3.79952007e-02,  4.69015048e-02, -1.00454973e-01, ...,
            2.30500578e-01, -1.89173790e-01,  2.17667557e-02]]])
weights=
array([[[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       ...,

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]]])

These are stored in instances of the sdarray class that we designed specifically for signature dynamics analysis. It is an extension of the standard NumPy ndarray but has additional attributes and some modified methods. The first axis is reserved for ensemble members and the mean, min, max and std are altered to average over this dimension rather than all dimensions.

We can look at the shape of these arrays and index them just like ndarray and ModeEnsemble objects. The eigenvalues are arranged in eigvals such that the first axis is the members and the second is the modes as in the mode ensemble.

In [52]:
eigvals.shape
Out[52]:
(70, 20)
In [53]:
eigvals[0:5,0:5]
Out[53]:
sdarray([[0.40332629, 1.09698756, 1.4466639 , 1.73630206, 1.99349677],
         [0.33585073, 1.37709037, 2.05877307, 1.5026893 , 1.95204566],
         [0.39625074, 1.03231662, 1.39041637, 1.6802333 , 1.74453408],
         [0.38475428, 1.20474748, 1.60459429, 1.91907853, 1.76988063],
         [0.39875231, 1.14861099, 1.58945168, 1.9900921 , 1.77448409]])

The eigenvectors are arranged in eigvecs such that the first axis is over the members, and the remaining dimensions are as in other eigenvector arrays - the second is over atoms and the third is mode index. Each atom has a weight, which varies between members and is important in calculating the mean, std, etc.

In [54]:
eigvecs.shape
Out[54]:
(70, 376, 20)

Step 4: Spectral overlap and distance

Spectral overlap, also known as covariance overlap, measures the overlap between two covariance matrices, or the overlap of a subset of the modes (a mode spectrum). This can also be converted into a distance using its arccosine as will be shown below.

We can calculate a matrix of spectral overlaps (so_matrix) over any slice of the ModeEnsemble that is still a mode ensemble itself, e.g.

In [55]:
so_matrix = calcEnsembleSpectralOverlaps(gnms[:, :1])
In [56]:
figure(figsize=(8,8))
showMatrix(so_matrix);

We can also obtain a spectral distance matrix (sd_matrix) from calcEnsembleSpectralOverlaps by giving it an additional argument:

In [57]:
sd_matrix = calcEnsembleSpectralOverlaps(gnms[:, :1], distance=True)
figure(figsize=(8,8)); showMatrix(sd_matrix);

We can then use this distance to calculate a tree. The labels from the mode ensemble as used as names for the leaves of the tree and are stored in their own variable/object for later use.

In [58]:
labels = dali_ens.getLabels()
so_tree = calcTree(names=labels, distance_matrix=sd_matrix, method='upgma')

We can show this tree using the function showTree:

In [59]:
showTree(so_tree);

We can also use this tree to reorder the so_matrix and obtain indices for reordering other objects:

In [60]:
reordered_so, new_so_indices = reorderMatrix(names=labels, matrix=so_matrix, tree=so_tree)
In [61]:
figure(figsize=(8,8))
showMatrix(reordered_so, ticklabels=new_so_indices);

As in the tree, we see 2-3 clusters with some finer structure within them as in the tree. These correspond to different subtypes of iGluRs called AMPA receptors (subunit paralogues GluA1-4, top) and kainate receptors (subunit paralogues GluK1-5, bottom) based on their preferred agonists as well as delta receptors at the bottom (these are flipped relative to the tree).

To show the matrix in the same order as the tree, we can add the option origin='upper':

In [62]:
figure(figsize=(8,8))
showMatrix(reordered_so, ticklabels=new_so_indices, origin='upper');

We can also show the tree along the y-axis of the matrix as follows:

In [63]:
figure(figsize=(11,8))
showMatrix(reordered_so, ticklabels=new_so_indices, origin='upper', 
           y_array=so_tree);

We can also use the resulting indices to reorder the ModeEnsemble and PDBEnsemble:

In [64]:
so_reordered_ens = dali_ens[new_so_indices]
so_reordered_gnms = gnms[new_so_indices, :]

Lists can only be used for indexing arrays not lists so we need to perform a type conversion prior to indexing in order to reorder the labels:

In [65]:
so_reordered_labels = np.array(labels)[new_so_indices]

Comparing with sequence and structural distances

The sequence distance is given by the (normalized) Hamming distance, which is calculated by subtracting the percentage identity (fraction) from 1, and the structural distance is the RMSD. We can also calculate and show the matrices and trees for these from the PDB ensemble.

First we calculate the sequence distance matrix:

In [66]:
seqid_matrix = buildSeqidMatrix(so_reordered_ens.getMSA())
seqdist_matrix = 1. - seqid_matrix
@> Sequence identity matrix was calculated in 0.00s.
In [67]:
figure(figsize=(8,8));
showMatrix(seqdist_matrix);

We can also construct a tree based on seqdist_matrix and use that to reorder it:

In [68]:
seqdist_tree = calcTree(names=so_reordered_labels, distance_matrix=seqdist_matrix, method='upgma')
showTree(seqdist_tree);

We can reorder seqdist_matrix with seqdist_tree as we did above with so_tree:

In [69]:
reordered_seqdist_seqdist, new_seqdist_indices = reorderMatrix(names=so_reordered_labels, 
                                                               matrix=seqdist_matrix, tree=seqdist_tree)
figure(figsize=(8,8));
showMatrix(reordered_seqdist_seqdist, ticklabels=new_seqdist_indices);

This shows us even clearer groups than the dynamic spectrum-based analysis. We see one subunit by itself at the bottom that is from a delta-type iGluR (GluD2), then two groups of kainate receptors (GluK5 and GluK2 with GluK3), and four groups of AMPARs (GluA1, GluA2, GluA4, and many structures from GluA3).

Similarily, once we obtain the RMSD matrix and tree using the getRMSDs method of the PDBEnsemble, we can calculate the structure-based tree:

In [70]:
rmsd_matrix = so_reordered_ens.getRMSDs(pairwise=True)
figure(figsize=(8,8)); showMatrix(rmsd_matrix);
In [71]:
rmsd_tree = calcTree(names=so_reordered_labels, 
                     distance_matrix=rmsd_matrix, 
                     method='upgma')

It could be of interest to put all three trees constructed based on different distance metrics side by side and compare them. We can do this using the subplot function from Matplotlib.

In [72]:
figure(figsize=(20,8));
subplot(1, 3, 1);
showTree(seqdist_tree, format='plt');
title('Sequence');
subplot(1, 3, 2);
showTree(rmsd_tree, format='plt');
title('Structure');
subplot(1, 3, 3);
showTree(so_tree, format='plt');
title('Dynamics');

Likewise, we can place the matrices side-by-side after having them all reordered the same way. We'll reorder by seqdist in this example:

In [73]:
reordered_rmsd_seqdist, new_seqdist_indices = reorderMatrix(names=so_reordered_labels, 
                                                            matrix=rmsd_matrix, tree=seqdist_tree)
reordered_sd_seqdist, new_seqdist_indices = reorderMatrix(names=so_reordered_labels, 
                                                          matrix=sd_matrix, tree=seqdist_tree)
In [74]:
figure(figsize=(20,8));
subplot(1, 3, 1);
showMatrix(reordered_seqdist_seqdist, ticklabels=new_seqdist_indices, origin='upper');
title('Sequence');
subplot(1, 3, 2);
showMatrix(reordered_rmsd_seqdist, ticklabels=new_seqdist_indices, origin='upper');
title('Structure');
subplot(1, 3, 3);
showMatrix(reordered_sd_seqdist, ticklabels=new_seqdist_indices, origin='upper');
title('Dynamics');

This analysis is quite sensitive to how many modes are used. As the number of modes approaches the full number, the dynamic distance order approaches the RMSD order. With smaller numbers, we see finer distinctions and there is a point where the dynamic distances are more in line with the sequence distances, which we call the low-to-intermediate frequency regime. In the current case where we used just one global mode (with the lowest frequency), we see small spectral distances but some subfamily differentiation is still apparent.

The same analysis could also be performed with a larger ensemble by selecting lower sequence identity and Z-score cutoffs as we did in our paper.

Now we have finished this tutorial, we reset the default path to the PDB folder, so that we aren't surprised next time we download PDBs and can't find them:

In [75]:
pathPDBFolder('')
@> PDB folder 'C:\\Users\\burak\\Desktop\\wormat\\pdbs' is released.
In [ ]: