Homologous Proteins¶
This example shows how to perform PCA of a structural dataset obtained by BLAST searching the PDB. The protein of interest is cytochrome c (cyt c). This dataset will contain structures sharing 44% or more sequence identity with human cyt c, i.e. its paralogs and/or orthologs.
A PCA
instance that stores the covariance matrix and principal modes that
describe the dominant changes in the dataset will be obtained. PCA
instance and principal modes (Mode
) can be used as input to functions
in dynamics
module for further analysis.
Input is amino acid sequence of the protein, a reference PDB identifier, and some parameters.
Setup¶
Import ProDy and matplotlib into the current namespace.
In [1]: from prody import *
In [2]: from pylab import *
In [3]: ion()
Name of the protein (a name without a white space is preferred)
In [4]: name = 'cyt_c'
In [5]: ref_pdb = '1hrc'
In order to perform a BLAST search of the PDB, we will need the amino acid sequence of our reference protein. We could get the FASTA format from the PDB, or we could get the sequence from the PDB file itself. A more attractive method (to us) is to get the sequence using ProDy.
In [6]: ref_prot = parsePDB(ref_pdb)
In [7]: ref_hv = ref_prot.getHierView()['A']
In [8]: sequence = ref_hv.getSequence()
This is the same as simply using
In [9]: sequence = '''GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEE
...: TLMEYLENPKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE'''
...:
Optionally, a list of PDB files to be excluded from analysis can be provided. In this case dimeric Cyt c structures are excluded from the analysis. To use all PDB hits, provide an empty list.
In [10]: exclude = ['3nbt', '3nbs']
Parameters¶
It is sometimes useful to set parameters in variables to use multiple times.
In this case, we use seqid
for the minimum sequence identity for including
sequences at both selection of BLAST hits and ensemble building.
In [11]: seqid = 44
Blast and download¶
A list of PDB structures can be obtained using blastPDB()
as follows:
In [12]: blast_record = blastPDB(sequence)
If this function times out, then you can ask the blast_record to try again
using the PDBBlastRecord.fetch()
. We can even do this in a loop to be sure:
In [13]: while not blast_record.isSuccess:
....: blast_record.fetch()
....:
It is a good practice to save this record on disk, as NCBI may not respond to
repeated searches for the same sequence. We can do this using the Python standard
library pickle
as follows:
In [14]: import pickle
The record is saved using the dump()
function:
In [15]: pickle.dump(blast_record, open('cytc_blast_record.pkl', 'wb'))
Then, it can be loaded using the load()
function:
In [16]: blast_record = pickle.load(open('cytc_blast_record.pkl', 'rb'))
We then read information from the record to extract a list of PDB IDs and chain IDs.
In [17]: pdb_hits = []
In [18]: for key, item in blast_record.getHits(seqid).iteritems():
....: pdb_hits.append((key, item['chain_id']))
....:
Let’s parse the PDB files and see how many there are:
In [19]: pdbs = parsePDB([pdb for pdb, ch in pdb_hits], subset='ca', compressed=False)
In [20]: len(pdbs)
Out[20]: 120
Set reference¶
We first parse the reference structure. Note that we parse only Cα atoms from chain A. The analysis will be performed for a single chain (monomeric) protein. For analysis of a dimeric protein see Multimeric Structures
In [21]: reference_structure = parsePDB(ref_pdb, subset='ca', chain='A')
In [22]: reference_hierview = reference_structure.getHierView()
In [23]: reference_chain = reference_hierview['A']
Prepare ensemble¶
X-ray structural ensembles are heterogenenous, i.e. different structures have different sets of unresolved residues. Hence, it is not straightforward to analyzed them as it would be for NMR models (see NMR Models).
ProDy has special functions and classes for facilitating efficient analysis
of the PDB X-ray data. In this example we use mapOntoChain()
function which returns an AtomMap
instance. See How AtomMap works for more details.
The resulting AtomMap
instances are used to prepare a PDBEnsemble
by mapping each structure against the reference chain and adding a coordinates set
corresponding to the mapped atoms. The overall procedure is shown in detail below
so you can understand the process and think about case specific changes such as those in
the Multimeric Structures tutorial. This process can also be automated using
buildPDBEnsemble()
as shown in the Heterogeneous X-ray Structures tutorial.
In [24]: startLogfile('pca_blast')
In [25]: ensemble = PDBEnsemble(name)
In [26]: ensemble.setAtoms(reference_chain)
In [27]: ensemble.setCoords(reference_chain.getCoords())
In [28]: for structure in pdbs:
....: if structure.getTitle()[:4] in exclude:
....: continue
....: if structure is None:
....: plog('Failed to parse ' + pdb_file)
....: continue
....: mappings = mapOntoChain(structure, reference_chain, seqid=seqid)
....: if len(mappings) == 0:
....: plog('Failed to map', structure.getTitle()[:4])
....: continue
....: atommap = mappings[0][0]
....: ensemble.addCoordset(atommap, weights=atommap.getFlags('mapped'))
....:
In [29]: ensemble.iterpose()
In [30]: saveEnsemble(ensemble)
Out[30]: 'cyt_c.ens.npz'
Let’s check how many conformations are extracted from PDB files:
In [31]: len(ensemble)
Out[31]: 407
Note that the number of conformations is larger than the number of PDB structures we retrieved. This is because some of the PDB files contained NMR structures with multiple models. Each model in NMR structures are added to the ensemble as individual conformations.
Write aligned conformations into a PDB file as follows:
In [32]: writePDB(name+'.pdb', ensemble)
Out[32]: 'cyt_c.pdb'
This file can be used to visualize the aligned conformations in a modeling software.
Align PDB files¶
alignByEnsemble()
function can be used to align PDB structures used
in the analysis from which you can write new PDB files.
The resulting files will contain intact structures and can be used for
visualization purposes. In this case, we will align only select PDB files:
In [33]: conf1_aligned, conf2_aligned = alignByEnsemble(pdbs[:2], ensemble[:2])
Let’s take a quick look at the aligned structures:
In [34]: showProtein(conf1_aligned, conf2_aligned);
In [35]: legend();
Perform PCA¶
Once the ensemble is ready, performing PCA is 3 easy steps:
In [36]: pca = PCA(name)
In [37]: pca.buildCovariance(ensemble)
In [38]: pca.calcModes()
The calculated data can be saved as a compressed file using saveModel()
function:
In [39]: saveModel(pca)
Out[39]: 'cyt_c.pca.npz'
Plot results¶
Let’s plot RMSDs of all conformations from the average conformation:
In [40]: rmsd = calcRMSD(ensemble)
In [41]: plot(rmsd);
In [42]: xlabel('Conformation index');
In [43]: ylabel('RMSD (A)');
Let’s show a projection of the ensemble onto PC1 and PC2:
In [44]: showProjection(ensemble, pca[:2]);