This example compares experimental structural data analyzed using Principal Component Analysis (PCA) with the theoretical data predicted by Anisotropic Network Model (ANM):
First make the necessary imports:
from prody import *
from numpy import *
from matplotlib.pyplot import *
prody.__version__
'2.0'
One way to retrieve data is to run an NCBI BLAST search against the PDB with the function blastPDB.
To do this, we first need to obtain a sequence and one way to do that is from the PDB:
p38 = parsePDB('1p38', compressed=False) # MAP KINASE
@> PDB file is found in working directory (1p38.pdb). @> 2962 atoms and 1 coordinate set(s) were parsed in 0.12s. @> Secondary structures were assigned to 188 residues.
We just want one sequence, so we get the sequence of chain A from the pdb file
p38_sequence = p38['A'].getSequence()
p38_sequence
'ERPTFYRQELNKTIWEVPERYQNLSPVGSGAYGSVCAAFDTKTGHRVAVKKLSRPFQSIIHAKRTYRELRLLKHMKHENVIGLLDVFTPARSLEEFNDVYLVTHLMGADLNNIVKCQKLTDDHVQFLIYQILRGLKYIHSADIIHRDLKPSNLAVNEDCELKILDFGLARHTDDEMTGYVATRWYRAPEIMLNWMHYNQTVDIWSVGCIMAELLTGRTLFPGTDHIDQLKLILRLVGTPGAELLKKISSESARNYIQSLAQMPKMNFANVFIGANPLAVDLLEKMLVLDSDKRITAAQALAHAYFAQYHDPDDEPVADPYDQSFESRDLLIDEWKSLTYDEVISFVPPPLD'
Once we have this sequence, we can use it in the function blastPDB
. We also provide a filename to save the output so we don't need to run it again. To reduce demand on the NCBI webserver, we have provided you this file so please do not run this command.
Instead, please load the data from the pickle file provided:
import pickle
blast_record = pickle.load(open('blast_record3.pkl', 'rb'))
blast_record
<prody.proteins.blastpdb.PDBBlastRecord at 0x7fb8fbc71100>
We can get hits from this record using certain parameters to filter them and extract a list of PDB IDs from them.
hits = blast_record.getHits(percent_identity=90, percent_overlap=70)
pdbids = list(hits.keys())
len(pdbids)
253
pdbids[:10]
['1lew', '1lez', '4ka3', '4loo', '4lop', '4loq', '5lar', '3tg1', '1bmk', '5uoj']
Next, we will use the parsePDB
function to import each one of the structures corresponding to these IDs.
Before doing that, we will make a folder to put them in (if it doesn't already exist) and configure ProDy to use that folder with the function pathPDBFolder
.
from os import mkdir
from os.path import isdir
if not isdir('pdbs'):
mkdir('pdbs')
pathPDBFolder('pdbs')
@> Local PDB folder is set: '/Users/bentley/Dropbox/Pitt/Bahar/MMBioS/2021/notebooks/pdbs' @> A plain folder structure will be assumed.
pdbs = parsePDB(pdbids)
@> 253 PDBs were parsed in 34.10s.
After parsing the structures from the PDB, we can use this function again to reset the default download folder back to our current directory:
pathPDBFolder('')
@> PDB folder '/Users/bentley/Dropbox/Pitt/Bahar/MMBioS/2021/notebooks/pdbs' is released.
Next, we make a selection to use as the reference for ensemble building:
ref_structure = p38
ref_selection = ref_structure.select('resnum 5 to 31 36 to 114 122 to '
'169 185 to 351 and calpha')
We extract chain A by indexing to get a Chain
object to make things easier.
ref_chain = ref_selection['A']
repr(ref_chain)
'<Chain: A from 1p38 (321 residues, 321 atoms)>'
We will prepare a PDBEnsemble
by mapping each structure against the reference chain and adding a coordinates set corresponding to the mapped atoms. We first make sure that our list of PDB structures (pdbs) includes the ref_chain.
pdbs.insert(0, ref_chain)
ensemble = buildPDBEnsemble(pdbs, mapping='seq', ref=ref_chain, title='p38')
@> Mapping 4geo to the reference... [ 99%] 1s
ensemble
<PDBEnsemble: p38 (278 conformations; 321 atoms)>
Now we will examine the structural dynamics of this ensemble using two different methods
PCA is a method that identifies the components which account for the greatest amount of variability in your dataset, i.e. ensemble.
pca = PCA('p38 xray') # Instantiate a PCA instance
pca.buildCovariance(ensemble) # Build covariance for the ensemble
pca.calcModes() # Calculate modes (20 of the by default)
The components/modes of variation are sorted such that the first modes contribute the greatest fractional variance, which we can show as follows:
for mode in pca[:5]: # Print % variance explained by top PCs
var = calcFractVariance(mode)*100
print('{0:s} % variance = {1:.2f}'.format(repr(mode), var))
<Mode: 1 from PCA p38 xray> % variance = 23.32 <Mode: 2 from PCA p38 xray> % variance = 15.79 <Mode: 3 from PCA p38 xray> % variance = 14.17 <Mode: 4 from PCA p38 xray> % variance = 10.19 <Mode: 5 from PCA p38 xray> % variance = 6.50
The first modes with the highest fractional variance are called the principal components (PCs).
The ANM allows for the identification of the most impactful (slowest) modes in dynamics of a single protein, which we can compare to the principal components from PCA.
anm = ANM('1p38') # Instantiate a ANM instance
anm.buildHessian(ref_chain) # Build Hessian for the reference chain
anm.calcModes() # Calculate slowest non-trivial 20 modes
One property that we can calculate and compare is the collectivity, which describes the extent to which a mode collectively recruits large portions of the structure. We see that most of the first modes from both calculations are highly collective.
for mode in pca[:3]: # Print PCA mode collectivity
coll = calcCollectivity(mode)
print('{0:s} collectivity = {1:.2f}'.format(repr(mode), coll))
<Mode: 1 from PCA p38 xray> collectivity = 0.53 <Mode: 2 from PCA p38 xray> collectivity = 0.18 <Mode: 3 from PCA p38 xray> collectivity = 0.63
for mode in anm[:3]: # Print ANM mode collectivity
coll = calcCollectivity(mode)
print('{0:s} collectivity = {1:.2f}'.format(repr(mode), coll))
<Mode: 1 from ANM 1p38> collectivity = 0.65 <Mode: 2 from ANM 1p38> collectivity = 0.55 <Mode: 3 from ANM 1p38> collectivity = 0.68
We can also look at how well the modes produced from each method correlate with each other using the overlap (correlation cosine).
printOverlapTable(pca[:3], anm[:3]) # Top 3 PCs vs slowest 3 ANM modes
Overlap Table ANM 1p38 #1 #2 #3 PCA p38 xray #1 -0.78 -0.07 -0.47 PCA p38 xray #2 -0.26 +0.09 +0.46 PCA p38 xray #3 +0.38 +0.43 -0.37
showOverlapTable(pca[:6], anm[:6]);
title('PCA - ANM Overlap Table');
The overlap table shows how each mode from the two methods overlap with each other. Some modes overlap very well with one other mode while others overlap with multiple modes to a lesser extent.
We can also look at the overlap between one mode and all others as follows. The cumulative overlap is the square root of the sum of squared overlaps.
showOverlap(pca[0], anm);
showCumulOverlap(pca[0], anm, color='r');
In order to see where in the protein these important motions occur, we can visualize the square fluctuations of the principal components and/or slow ANM modes as follows. The function showScaledSqFlucts
allows us to scale the square fluctuations from each set of modes to have the same overall size for easier comparison.
We can apply this to individual modes, such as overlapping mode 1 (index 0) from each method, or multiple modes such as the first 3.
showScaledSqFlucts(pca[1], anm[1]);
legend();
showScaledSqFlucts(pca[3], anm[1]);
legend();
We can also see how correlated the motions for each residue are with each other residue. We see similar patterns for the two methods, especially when using a large number of modes.
showCrossCorr(pca[0]);
showCrossCorr(anm[0]);
We can save the outputs from each step for loading back into ProDy as follows:
writePDB('p38_ref_chain.pdb', ref_chain)
saveEnsemble(ensemble)
saveModel(pca)
saveModel(anm)
'1p38.anm.npz'
We can also prepare outputs in NMD format for visualising in VMD with the normal mode wizard:
writeNMD('p38_pca.nmd',anm,ref_chain)
'p38_pca.nmd'
writeNMD('p38_anm.nmd',pca,ref_chain)
'p38_anm.nmd'