This example compares experimental structural data analyzed using Principal Component Analysis (PCA) with the theoretical data predicted by Anisotropic Network Model (ANM):

First make the necessary imports:

In [1]:

```
from prody import *
from numpy import *
from matplotlib.pyplot import *
```

In [2]:

```
prody.__version__
```

Out[2]:

'2.0'

One way to retrieve data is to run an NCBI BLAST search against the PDB with the function blastPDB.

To do this, we first need to obtain a sequence and one way to do that is from the PDB:

In [3]:

```
p38 = parsePDB('1p38', compressed=False) # MAP KINASE
```

We just want one sequence, so we get the sequence of chain A from the pdb file

In [4]:

```
p38_sequence = p38['A'].getSequence()
p38_sequence
```

Out[4]:

'ERPTFYRQELNKTIWEVPERYQNLSPVGSGAYGSVCAAFDTKTGHRVAVKKLSRPFQSIIHAKRTYRELRLLKHMKHENVIGLLDVFTPARSLEEFNDVYLVTHLMGADLNNIVKCQKLTDDHVQFLIYQILRGLKYIHSADIIHRDLKPSNLAVNEDCELKILDFGLARHTDDEMTGYVATRWYRAPEIMLNWMHYNQTVDIWSVGCIMAELLTGRTLFPGTDHIDQLKLILRLVGTPGAELLKKISSESARNYIQSLAQMPKMNFANVFIGANPLAVDLLEKMLVLDSDKRITAAQALAHAYFAQYHDPDDEPVADPYDQSFESRDLLIDEWKSLTYDEVISFVPPPLD'

Once we have this sequence, we can use it in the function `blastPDB`

. We also provide a filename to save the output so we don't need to run it again. To reduce demand on the NCBI webserver, we have provided you this file so please do not run this command.

Instead, please load the data from the pickle file provided:

In [5]:

```
import pickle
blast_record = pickle.load(open('blast_record3.pkl', 'rb'))
```

In [6]:

```
blast_record
```

Out[6]:

<prody.proteins.blastpdb.PDBBlastRecord at 0x7fb8fbc71100>

We can get hits from this record using certain parameters to filter them and extract a list of PDB IDs from them.

In [7]:

```
hits = blast_record.getHits(percent_identity=90, percent_overlap=70)
```

In [8]:

```
pdbids = list(hits.keys())
len(pdbids)
```

Out[8]:

253

In [9]:

```
pdbids[:10]
```

Out[9]:

['1lew', '1lez', '4ka3', '4loo', '4lop', '4loq', '5lar', '3tg1', '1bmk', '5uoj']

Next, we will use the `parsePDB`

function to import each one of the structures corresponding to these IDs.

Before doing that, we will make a folder to put them in (if it doesn't already exist) and configure *ProDy* to use that folder with the function `pathPDBFolder`

.

In [10]:

```
from os import mkdir
from os.path import isdir
if not isdir('pdbs'):
mkdir('pdbs')
```

In [11]:

```
pathPDBFolder('pdbs')
```

In [12]:

```
pdbs = parsePDB(pdbids)
```

@> 253 PDBs were parsed in 34.10s.

After parsing the structures from the PDB, we can use this function again to reset the default download folder back to our current directory:

In [13]:

```
pathPDBFolder('')
```

@> PDB folder '/Users/bentley/Dropbox/Pitt/Bahar/MMBioS/2021/notebooks/pdbs' is released.

Next, we make a selection to use as the reference for ensemble building:

In [14]:

```
ref_structure = p38
ref_selection = ref_structure.select('resnum 5 to 31 36 to 114 122 to '
'169 185 to 351 and calpha')
```

We extract chain A by indexing to get a `Chain`

object to make things easier.

In [15]:

```
ref_chain = ref_selection['A']
repr(ref_chain)
```

Out[15]:

'<Chain: A from 1p38 (321 residues, 321 atoms)>'

We will prepare a `PDBEnsemble`

by mapping each structure against the reference chain and adding a coordinates set corresponding to the mapped atoms. We first make sure that our list of PDB structures (**pdbs**) includes the **ref_chain**.

In [16]:

```
pdbs.insert(0, ref_chain)
```

In [20]:

```
ensemble = buildPDBEnsemble(pdbs, mapping='seq', ref=ref_chain, title='p38')
```

@> Mapping 4geo to the reference... [ 99%] 1s

In [21]:

```
ensemble
```

Out[21]:

<PDBEnsemble: p38 (278 conformations; 321 atoms)>

Now we will examine the structural dynamics of this ensemble using two different methods

PCA is a method that identifies the components which account for the greatest amount of variability in your dataset, i.e. ensemble.

In [22]:

```
pca = PCA('p38 xray') # Instantiate a PCA instance
pca.buildCovariance(ensemble) # Build covariance for the ensemble
pca.calcModes() # Calculate modes (20 of the by default)
```

The components/modes of variation are sorted such that the first modes contribute the greatest fractional variance, which we can show as follows:

In [23]:

```
for mode in pca[:5]: # Print % variance explained by top PCs
var = calcFractVariance(mode)*100
print('{0:s} % variance = {1:.2f}'.format(repr(mode), var))
```

The first modes with the highest fractional variance are called the principal components (PCs).

The ANM allows for the identification of the most impactful (slowest) modes in dynamics of a single protein, which we can compare to the principal components from PCA.

In [24]:

```
anm = ANM('1p38') # Instantiate a ANM instance
anm.buildHessian(ref_chain) # Build Hessian for the reference chain
anm.calcModes() # Calculate slowest non-trivial 20 modes
```

One property that we can calculate and compare is the collectivity, which describes the extent to which a mode collectively recruits large portions of the structure. We see that most of the first modes from both calculations are highly collective.

In [25]:

```
for mode in pca[:3]: # Print PCA mode collectivity
coll = calcCollectivity(mode)
print('{0:s} collectivity = {1:.2f}'.format(repr(mode), coll))
```

In [26]:

```
for mode in anm[:3]: # Print ANM mode collectivity
coll = calcCollectivity(mode)
print('{0:s} collectivity = {1:.2f}'.format(repr(mode), coll))
```

We can also look at how well the modes produced from each method correlate with each other using the overlap (correlation cosine).

In [27]:

```
printOverlapTable(pca[:3], anm[:3]) # Top 3 PCs vs slowest 3 ANM modes
```

In [28]:

```
showOverlapTable(pca[:6], anm[:6]);
title('PCA - ANM Overlap Table');
```

The overlap table shows how each mode from the two methods overlap with each other. Some modes overlap very well with one other mode while others overlap with multiple modes to a lesser extent.

We can also look at the overlap between one mode and all others as follows. The cumulative overlap is the square root of the sum of squared overlaps.

In [29]:

```
showOverlap(pca[0], anm);
showCumulOverlap(pca[0], anm, color='r');
```

In order to see where in the protein these important motions occur, we can visualize the square fluctuations of the principal components and/or slow ANM modes as follows. The function `showScaledSqFlucts`

allows us to scale the square fluctuations from each set of modes to have the same overall size for easier comparison.

We can apply this to individual modes, such as overlapping mode 1 (index 0) from each method, or multiple modes such as the first 3.

In [30]:

```
showScaledSqFlucts(pca[1], anm[1]);
legend();
```

In [31]:

```
showScaledSqFlucts(pca[3], anm[1]);
legend();
```

We can also see how correlated the motions for each residue are with each other residue. We see similar patterns for the two methods, especially when using a large number of modes.

In [32]:

```
showCrossCorr(pca[0]);
```

In [33]:

```
showCrossCorr(anm[0]);
```

We can save the outputs from each step for loading back into *ProDy* as follows:

In [34]:

```
writePDB('p38_ref_chain.pdb', ref_chain)
saveEnsemble(ensemble)
saveModel(pca)
saveModel(anm)
```

Out[34]:

'1p38.anm.npz'

We can also prepare outputs in NMD format for visualising in VMD with the normal mode wizard:

In [35]:

```
writeNMD('p38_pca.nmd',anm,ref_chain)
```

Out[35]:

'p38_pca.nmd'

In [36]:

```
writeNMD('p38_anm.nmd',pca,ref_chain)
```

Out[36]:

'p38_anm.nmd'

In [ ]:

```
```

In [ ]:

```
```