Ensemble Analysis

This example compares experimental structural data analyzed using Principal Component Analysis (PCA) with the theoretical data predicted by Anisotropic Network Model (ANM):

First make the necessary imports:

In [1]:
from prody import *
from numpy import *
from matplotlib.pyplot import *

Retrieve dataset

One way to retrieve data is to run an NCBI BLAST search against the PDB with the function blastPDB.

To do this, we first need to obtain a sequence and one way to do that is from the PDB:

In [2]:
p38 = parsePDB('1p38', compressed=False) # MAP KINASE
@> Connecting wwPDB FTP server RCSB PDB (USA).
@> pdb1p38.ent.gz download failed. 1p38 does not exist on ftp.wwpdb.org.
@> PDB download via FTP completed (0 downloaded, 1 failed).
@> Downloading PDB files via FTP failed, trying HTTP.
@> 1p38 downloaded (1p38.pdb)
@> PDB download via HTTP completed (1 downloaded, 0 failed).
@> 2962 atoms and 1 coordinate set(s) were parsed in 0.10s.
@> Secondary structures were assigned to 188 residues.

We just want one sequence, so we get the sequence of chain A from the pdb file

In [3]:
p38_sequence = p38['A'].getSequence()

Once we have this sequence, we can use it in the function blastPDB. We also provide a filename to save the output so we don't need to run it again. To reduce demand on the NCBI webserver, we have provided you this file so please do not run this command.

In [ ]:
# blast_record = blastPDB(p38_sequence)

Instead, please load the data from the pickle file provided:

In [4]:
import pickle

blast_record = pickle.load(open('blast_record3.pkl', 'rb'))
In [5]:
<prody.proteins.blastpdb.PDBBlastRecord at 0x20a47ef5a30>

We can get hits from this record using certain parameters to filter them and extract a list of PDB IDs from them.

In [6]:
hits = blast_record.getHits(percent_identity=90, percent_overlap=70)
In [7]:
pdbids = list(hits.keys())

Next, we will use the parsePDB function to import each one of the structures corresponding to these IDs.

Before doing that, we will make a folder to put them in (if it doesn't already exist) and configure ProDy to use that folder with the function pathPDBFolder.

In [8]:
from os import mkdir
from os.path import isdir

if not isdir('pdbs'):
In [9]:
@> Local PDB folder is set: 'C:\\Users\\burak\\Desktop\\wormat\\pdbs'
@> A plain folder structure will be assumed.
In [10]:
pdbs = parsePDB(pdbids)
@> 253 PDBs were parsed in 163.99s.

After parsing the structures from the PDB, we can use this function again to reset the default download folder back to our current directory:

In [11]:
@> PDB folder 'C:\\Users\\burak\\Desktop\\wormat\\pdbs' is released.

Set reference chain

Next, we make a selection to use as the reference for ensemble building:

In [12]:
ref_structure = p38
ref_selection = ref_structure.select('resnum 5 to 31 36 to 114 122 to '
                                     '169 185 to 351 and calpha')

We extract chain A by indexing to get a Chain object to make things easier.

In [13]:
ref_chain = ref_selection['A']
'<Chain: A from 1p38 (321 residues, 321 atoms)>'

Ensemble Preparation

We will prepare a PDBEnsemble by mapping each structure against the reference chain and adding a coordinates set corresponding to the mapped atoms. We first make sure that our list of PDB structures (pdbs) includes the ref_chain.

In [14]:
pdbs.insert(0, ref_chain)
In [15]:
ensemble = buildPDBEnsemble(pdbs, ref=ref_chain, title='p38')
@> Mapping 1lew to the reference... [  0%]@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain B from 1lew with 10 residues>
@> Mapping 1lez to the reference... [  0%]@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain B from 1lez with 8 residues>
@> Mapping 4ka3 to the reference... [  1%]@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain B from 4ka3 with 8 residues>
@> Mapping 4loq to the reference... [  2%]@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain N from 4loq with 18 residues>
@> Mapping 3p4k to the reference... [  8%] 38s@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain P from 3p4k with 11 residues>
@> Mapping 5eta to the reference... [ 40%] 16s@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain D from 5eta with 16 residues>
@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain C from 5eta with 16 residues>
@> Mapping 5etf to the reference... [ 75%] 6s @> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain B from 5etf with 8 residues>
@> Mapping 2y8o to the reference... [ 82%] 4s@> WARNING cealign could not align <SimpleChain: Chain A from 1p38 with 321 residues> and <SimpleChain: Chain B from 2y8o with 8 residues>
@> Starting iterative superposition:         
@> Step #1: RMSD difference = 8.6391e-01
@> Step #2: RMSD difference = 3.5127e-04
@> Step #3: RMSD difference = 4.9810e-07
@> Iterative superposition completed in 0.29s.
@> Final superposition to calculate transformations.
@> Superposition completed in 0.11 seconds.
@> Ensemble (278 conformations) were built in 21.94s.
In [16]:
<PDBEnsemble: p38 (278 conformations; 321 atoms)>

Ensemble Dynamics

Now we will examine the structural dynamics of this ensemble using two different methods

1. Principal Component Analysis (PCA)

PCA is a method that identifies the components which account for the greatest amount of variability in your dataset, i.e. ensemble.

In [17]:
pca = PCA('p38 xray')           # Instantiate a PCA instance
pca.buildCovariance(ensemble)   # Build covariance for the ensemble
pca.calcModes()                 # Calculate modes (20 of the by default)
@> Covariance is calculated using 278 coordinate sets.
@> Covariance matrix calculated in 0.097548s.
@> 20 modes were calculated in 0.27s.

The components/modes of variation are sorted such that the first modes contribute the greatest fractional variance, which we can show as follows:

In [18]:
for mode in pca[:5]:    # Print % variance explained by top PCs
    var = calcFractVariance(mode)*100
    print('{0:s}  % variance = {1:.2f}'.format(repr(mode), var))
<Mode: 1 from PCA p38 xray>  % variance = 23.32
<Mode: 2 from PCA p38 xray>  % variance = 15.79
<Mode: 3 from PCA p38 xray>  % variance = 14.17
<Mode: 4 from PCA p38 xray>  % variance = 10.19
<Mode: 5 from PCA p38 xray>  % variance = 6.50

The first modes with the highest fractional variance are called the principal components (PCs).

2. Anisotropic Network Model (ANM) Normal Mode Analysis (NMA)

The ANM allows for the identification of the most impactful (slowest) modes in dynamics of a single protein, which we can compare to the principal components from PCA.

In [19]:
anm = ANM('1p38')             # Instantiate a ANM instance
anm.buildHessian(ref_chain)   # Build Hessian for the reference chain
anm.calcModes()               # Calculate slowest non-trivial 20 modes
@> Hessian was built in 0.29s.
@> 20 modes were calculated in 1.10s.

Analysis of PCA and ANM modes

Collectivity of modes

One property that we can calculate and compare is the collectivity, which describes the extent to which a mode collectively recruits large portions of the structure. We see that most of the first modes from both calculations are highly collective.

In [20]:
for mode in pca[:3]:    # Print PCA mode collectivity
    coll = calcCollectivity(mode)
    print('{0:s}  collectivity = {1:.2f}'.format(repr(mode), coll))
<Mode: 1 from PCA p38 xray>  collectivity = 0.53
<Mode: 2 from PCA p38 xray>  collectivity = 0.18
<Mode: 3 from PCA p38 xray>  collectivity = 0.63
In [ ]:
for mode in anm[:3]:    # Print ANM mode collectivity
    coll = calcCollectivity(mode)
    print('{0:s}  collectivity = {1:.2f}'.format(repr(mode), coll))

PCA - ANM overlap

We can also look at how well the modes produced from each method correlate with each other using the overlap (correlation cosine).

In [21]:
printOverlapTable(pca[:3], anm[:3]) # Top 3 PCs vs slowest 3 ANM modes
Overlap Table
                        ANM 1p38
                    #1     #2     #3
PCA p38 xray #1   -0.78  -0.07  -0.47
PCA p38 xray #2   -0.26  +0.09  +0.46
PCA p38 xray #3   +0.38  +0.43  -0.37

In [22]:
showOverlapTable(pca[:6], anm[:6]);
title('PCA - ANM Overlap Table');

The overlap table shows how each mode from the two methods overlap with each other. Some modes overlap very well with one other mode while others overlap with multiple modes to a lesser extent.

We can also look at the overlap between one mode and all others as follows. The cumulative overlap is the square root of the sum of squared overlaps.

In [23]:
showOverlap(pca[0], anm);
showCumulOverlap(pca[0], anm, color='r');

Square Fluctuations

In order to see where in the protein these important motions occur, we can visualize the square fluctuations of the principal components and/or slow ANM modes as follows. The function showScaledSqFlucts allows us to scale the square fluctuations from each set of modes to have the same overall size for easier comparison.

We can apply this to individual modes, such as overlapping mode 1 (index 0) from each method, or multiple modes such as the first 3.

In [24]:
showScaledSqFlucts(pca[0], anm[0]);
In [25]:
showScaledSqFlucts(pca[3], anm[1]);

Cross Correlations

We can also see how correlated the motions for each residue are with each other residue. We see similar patterns for the two methods, especially when using a large number of modes.

In [26]:
In [27]:

Saving your work

We can save the outputs from each step for loading back into ProDy as follows:

In [28]:
writePDB('p38_ref_chain.pdb', ref_chain)

We can also prepare outputs in NMD format for visualising in VMD with the normal mode wizard:

In [29]:
In [30]:
In [ ]: