Homologous Proteins

This example shows how to perform PCA of a structural dataset obtained by BLAST searching the PDB. The protein of interest is cytochrome c (cyt c). This dataset will contain structures sharing 44% or more sequence identity with human cyt c, i.e. its paralogs and/or orthologs.

A PCA instance that stores the covariance matrix and principal modes that describe the dominant changes in the dataset will be obtained. PCA instance and principal modes (Mode) can be used as input to functions in dynamics module for further analysis.

Input is amino acid sequence of the protein, a reference PDB identifier, and some parameters.


Import ProDy and matplotlib into the current namespace.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

Name of the protein (a name without a white space is preferred)

In [4]: name = 'cyt_c'

In [5]: ref_pdb = '1hrc'

In order to perform a BLAST search of the PDB, we will need the amino acid sequence of our reference protein. We could get the FASTA format from the PDB, or we could get the sequence from the PDB file itself. A more attractive method (to us) is to get the sequence using ProDy.

In [6]: ref_prot = parsePDB(ref_pdb)

In [7]: ref_hv = ref_prot.getHierView()['A']

In [8]: sequence = ref_hv.getSequence()

This is the same as simply using


Optionally, a list of PDB files to be excluded from analysis can be provided. In this case dimeric Cyt c structures are excluded from the analysis. To use all PDB hits, provide an empty list.

In [10]: exclude = ['3nbt', '3nbs']


It is sometimes useful to set parameters in variables to use multiple times. In this case, we use seqid for the minimum sequence identity for including sequences at both selection of BLAST hits and ensemble building.

In [11]: seqid = 44

Blast and download

A list of PDB structures can be obtained using blastPDB() as follows:

In [12]: blast_record = blastPDB(sequence)

If this function times out, then you can ask the blast_record to try again using the PDBBlastRecord.fetch(). We can even do this in a loop to be sure:

In [13]: while not blast_record.isSuccess:
   ....:    blast_record.fetch()

It is a good practice to save this record on disk, as NCBI may not respond to repeated searches for the same sequence. We can do this using the Python standard library pickle as follows:

In [14]: import pickle

The record is saved using the dump() function:

In [15]: pickle.dump(blast_record, open('cytc_blast_record.pkl', 'wb'))

Then, it can be loaded using the load() function:

In [16]: blast_record = pickle.load(open('cytc_blast_record.pkl', 'rb'))

We then read information from the record to extract a list of PDB IDs and chain IDs.

In [17]: pdb_hits = []

In [18]: for key, item in blast_record.getHits(seqid).iteritems():
   ....:     pdb_hits.append((key, item['chain_id']))

Let’s parse the PDB files and see how many there are:

In [19]: pdbs = parsePDB([pdb for pdb, ch in pdb_hits], subset='ca', compressed=False)
In [20]: len(pdbs)
Out[20]: 120

Set reference

We first parse the reference structure. Note that we parse only Cα atoms from chain A. The analysis will be performed for a single chain (monomeric) protein. For analysis of a dimeric protein see Multimeric Structures

In [21]: reference_structure = parsePDB(ref_pdb, subset='ca', chain='A')

In [22]: reference_hierview = reference_structure.getHierView()

In [23]: reference_chain = reference_hierview['A']

Prepare ensemble

X-ray structural ensembles are heterogenenous, i.e. different structures have different sets of unresolved residues. Hence, it is not straightforward to analyzed them as it would be for NMR models (see NMR Models).

ProDy has special functions and classes for facilitating efficient analysis of the PDB X-ray data. In this example we use mapOntoChain() function which returns an AtomMap instance. See How AtomMap works for more details.

The resulting AtomMap instances are used to prepare a PDBEnsemble by mapping each structure against the reference chain and adding a coordinates set corresponding to the mapped atoms. The overall procedure is shown in detail below so you can understand the process and think about case specific changes such as those in the Multimeric Structures tutorial. This process can also be automated using buildPDBEnsemble() as shown in the Heterogeneous X-ray Structures tutorial.

In [24]: startLogfile('pca_blast')

In [25]: ensemble = PDBEnsemble(name)

In [26]: ensemble.setAtoms(reference_chain)

In [27]: ensemble.setCoords(reference_chain.getCoords())
In [28]: for structure in pdbs:
   ....:     if structure.getTitle()[:4] in exclude:
   ....:         continue
   ....:     if structure is None:
   ....:         plog('Failed to parse ' + pdb_file)
   ....:         continue
   ....:     mappings = mapOntoChain(structure, reference_chain, seqid=seqid)
   ....:     if len(mappings) == 0:
   ....:         plog('Failed to map', structure.getTitle()[:4])
   ....:         continue
   ....:     atommap = mappings[0][0]
   ....:     ensemble.addCoordset(atommap, weights=atommap.getFlags('mapped'))

In [29]: ensemble.iterpose()

In [30]: saveEnsemble(ensemble)
Out[30]: 'cyt_c.ens.npz'

Let’s check how many conformations are extracted from PDB files:

In [31]: len(ensemble)
Out[31]: 407

Note that the number of conformations is larger than the number of PDB structures we retrieved. This is because some of the PDB files contained NMR structures with multiple models. Each model in NMR structures are added to the ensemble as individual conformations.

Write aligned conformations into a PDB file as follows:

In [32]: writePDB(name+'.pdb', ensemble)
Out[32]: 'cyt_c.pdb'

This file can be used to visualize the aligned conformations in a modeling software.

Align PDB files

alignByEnsemble() function can be used to align PDB structures used in the analysis from which you can write new PDB files. The resulting files will contain intact structures and can be used for visualization purposes. In this case, we will align only select PDB files:

In [33]: conf1_aligned, conf2_aligned = alignByEnsemble(pdbs[:2], ensemble[:2])

Let’s take a quick look at the aligned structures:

In [34]: showProtein(conf1_aligned, conf2_aligned);

In [35]: legend();

Perform PCA

Once the ensemble is ready, performing PCA is 3 easy steps:

In [36]: pca = PCA(name)

In [37]: pca.buildCovariance(ensemble)

In [38]: pca.calcModes()

The calculated data can be saved as a compressed file using saveModel() function:

In [39]: saveModel(pca)
Out[39]: 'cyt_c.pca.npz'

Plot results

Let’s plot RMSDs of all conformations from the average conformation:

In [40]: rmsd = calcRMSD(ensemble)

In [41]: plot(rmsd);

In [42]: xlabel('Conformation index');

In [43]: ylabel('RMSD (A)');

Let’s show a projection of the ensemble onto PC1 and PC2:

In [44]: showProjection(ensemble, pca[:2]);