Homologous Proteins

This example shows how to perform PCA of a structural dataset obtained by blast searching PDB. The protein of interest is cytochrome c (cyt c). Dataset will contain structures sharing 44% or more sequence identity with human cyt c, i.e. its homologs and/or orthologs.

A PCA instance that stores covariance matrix and principal modes that describe the dominant changes in the dataset will be obtained. PCA instance and principal modes (Mode) can be used as input to functions in dynamics module for further analysis.

Input is amino acid sequence of the protein, a reference PDB identifier, and some parameters.


Import ProDy and matplotlib into the current namespace.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

Name of the protein (a name without a white space is preferred)

In [4]: name = 'cyt_c'

In [5]: ref_pdb = '1hrc'

In order to perform a BLAST search of the PDB, we will need the amino acid sequence of our reference protein. We could get the FASTA format from the PDB, or we could get the sequence from the PDB file itself. A more attractive method (to us) is to get the sequence using ProDy.

In [6]: ref_prot = parsePDB(ref_pdb)

In [7]: ref_hv = ref_prot.getHierView()['A']

In [8]: sequence = ref_hv.getSequence()

This is the same as simply using


Optionally, a list of PDB files to be excluded from analysis can be provided. In this case dimeric Cyt c structures are excluded from the analysis. To use all PDB hits, provide an empty list.

In [10]: exclude = ['3nbt', '3nbs']


# Minimum sequence identity of hits
In [11]: seqid = 44

# Reference chain identifier
In [12]: ref_chid = 'A'
# Selection string ("all" can be used if all of the chain is to be analyzed)
In [13]: selstr = 'resnum 1 to 103'

Blast and download

List of PDB structures can be obtained using blastPDB() as follows:

In [14]: blast_record = blastPDB(sequence)

It is a good practice to save this record on disk, as NCBI may not respond to repeated searches for the same sequence. We can do this using Python standard library pickle as follows:

In [15]: import pickle

Record is save using dump() function into an open file:

In [16]: pickle.dump(blast_record, open('cytc_blast_record.pkl', 'w'))

Then, it can be loaded using load() function:

In [17]: blast_record = pickle.load(open('cytc_blast_record.pkl'))
In [18]: pdb_hits = []

In [19]: for key, item in blast_record.getHits(seqid).iteritems():
   ....:     pdb_hits.append((key, item['chain_id']))

Let’s fetch PDB files and see how many there are:

In [20]: pdb_files = fetchPDB(*[pdb for pdb, ch in pdb_hits], compressed=False)

In [21]: len(pdb_files)
Out[21]: 99

Set reference

We first parse the reference structure. Note that we parse only Cα atoms from chain A. The analysis will be performed for a single chain (monomeric) protein. For analysis of a dimeric protein see Multimeric Structures

In [22]: reference_structure = parsePDB(ref_pdb, subset='ca', chain=ref_chid)

# Get the reference chain from this structure
In [23]: reference_hierview = reference_structure.getHierView()

In [24]: reference_chain = reference_hierview[ref_chid]

Prepare ensemble

# Start a log file
In [25]: startLogfile('pca_blast')

# Instantiate a PDB ensemble
In [26]: ensemble = PDBEnsemble(name)

# Set ensemble atoms
In [27]: ensemble.setAtoms(reference_chain)

# Set reference coordinates
In [28]: ensemble.setCoords(reference_chain.getCoords())
In [29]: for (pdb_id, chain_id), pdb_file in zip(pdb_hits, pdb_files):
   ....:     if pdb_id in exclude:
   ....:         continue
   ....:     structure = parsePDB(pdb_file, subset='calpha', chain=chain_id)
   ....:     if structure is None:
   ....:         plog('Failed to parse ' + pdb_file)
   ....:         continue
   ....:     mappings = mapOntoChain(structure, reference_chain, seqid=seqid)
   ....:     if len(mappings) == 0:
   ....:         plog('Failed to map', pdb_id)
   ....:         continue
   ....:     atommap = mappings[0][0]
   ....:     ensemble.addCoordset(atommap, weights=atommap.getFlags('mapped'))

In [30]: ensemble.iterpose()

In [31]: saveEnsemble(ensemble)
Out[31]: 'cyt_c.ens.npz'

Let’s check how many conformations are extracted from PDB files:

In [32]: len(ensemble)
Out[32]: 405

Note that number of conformations is larger than the number of PDB structures we retrieved. This is because some of the PDB files contained NMR structures with multiple models. Each model in NMR structures are added to the ensemble as individual conformations.

Write aligned conformations into a PDB file as follows:

In [33]: writePDB(name+'.pdb', ensemble)
Out[33]: 'cyt_c.pdb'

This file can be used to visualize the aligned conformations in a modeling software.

Align PDB files

alignPDBEnsemble() function can be used to align all PDB structures used in the analysis, e.g. alignPDBEnsemble(ensemble). Outputted files will contain intact structures and can be used for visualization purposes in other software. In this case, we will align only select PDB files:

In [34]: conf1_aligned = alignPDBEnsemble(ensemble[0])

In [35]: conf2_aligned = alignPDBEnsemble(ensemble[1])

Let’s take a quick look at the aligned structures:

In [36]: showProtein(parsePDB(conf1_aligned), parsePDB(conf2_aligned));

In [37]: legend();

Perform PCA

Once the ensemble is ready, performing PCA is 3 easy steps:

# Instantiate a PCA
In [38]: pca = PCA(name)

# Build covariance matrix
In [39]: pca.buildCovariance(ensemble)

# Calculate modes
In [40]: pca.calcModes()

The calculated data can be saved as a compressed file using saveModel() function:

In [41]: saveModel(pca)
Out[41]: 'cyt_c.pca.npz'

Plot results

Let’s plot RMSDs of all conformations from the average conformation:

In [42]: rmsd = calcRMSD(ensemble)

In [43]: plot(rmsd);

In [44]: xlabel('Conformation index');

In [45]: ylabel('RMSD (A)');

Let’s show a projection of the ensemble onto PC1 and PC2:

In [46]: showProjection(ensemble, pca[:2]);