Data Collection with PDB IDs
- Prepare ensemble

Release Notes

v1.11 series come with new and improved sequence, structure, and dynamics analysis features. See release notes for details.

How to Cite

Bakan A, Meireles LM, Bahar I ProDy: Protein Dynamics Inferred from Theory and Experiments
Bioinformatics 2011 27(11):1575-1577.

Bakan A, Dutta A, Mao W, Liu Y, Chennubhotla C, Lezon TR, Bahar I Evol and ProDy for Bridging Protein Sequence Evolution and Structural Dynamics
Bioinformatics 2014 30(18):2681-2683.

Data Collection with PDB IDs¶

The first step in signature dynamics analysis is to collect a set of related protein structures and build a PDBEnsemble. This can be achieved by multiple routes: a query search of the PDB using blastPDB() or searchDali(), extraction of PDB IDs from the Pfam or CATH database, or input of a pre-defined list.

We demonstrate the usage of SignDy with a pre-defined list of transporter proteins sharing the common LeuT fold [YS13]. These proteins cycle through four typical states to transport a substrate molecule: outward-facing open (OFo), outward-facing closed (OFc), inward-facing open (IFo), inward-facing closed (IFc), but only the first three states have PDB structures available. If you know how to prepare an ensemble of structural homologs and wish to skip this part, you can download the ensemble file used in [SZ18] from here and proceed to the next tutorial.

First, make necessary imports from ProDy and Matplotlib packages if you haven’t already.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

Prepare ensemble¶

For convinience and clarity, we define LeuT folds in separate lists taxonomically. For example, the PDB identifiers for bacterial Leucine transporters are defined as follows:

In [4]: LeuTs = ['2A65', '2Q6H', '2Q72', '2QB4', '2QEI', '2QJU', '3F3A', '3F3C',
   ...:          '3F3D', '3F3E', '3F48', '3F4I', '3F4J', '3GJC', '3GJD', '3GWU',
   ...:          '3GWV', '3GWW', '3MPN', '3MPQ', '3QS4', '3QS5', '3QS6', '3TT1',
   ...:          '3TU0', '3USG', '3USI', '3USJ', '3USK', '3USL', '3USM', '3USO',
   ...:          '3USP', '4FXZ', '4FY0', '4HMK', '4HOD', '4MM4', '4MM5', '4MM6',
   ...:          '4MM7', '4MM8', '4MM9', '4MMA', '4MMB', '4MMC', '4MMD', '4MME',
   ...:          '4MMF', '3TT3']
   ...: 

Despite the fact that bacterial Leucine transporters can form dimers, we will only take the chain A in each structure:

In [5]: LeuTs = [protID + 'A' for protID in LeuTs]

Note that in the above line, we use list comprehension to add a letter ‘A’ to each PDB identifier in the list to select chain A. We define other LeuT folds similarily:

In [6]: DATs = ['4M48', '4XNU', '4XNX', '4XP1', '4XP4', '4XP5', '4XP6',
   ...:         '4XP9C', '4XPA', '4XPB', '4XPF', '4XPG', '4XPH', '4XPT']
   ...: 

In [7]: DATs = [protID + 'A' for protID in DATs if protID is not '4XP9C']

In [8]: MhsTs = ['4US4A', '4US3A']

In [9]: vSGLTs = ['2XQ2A']

In [10]: Mhp1s = ['2JLN', '2X79', '4D1A', '4D1B', '4D1C', '4D1D']

In [11]: Mhp1s = [protID + 'A' for protID in Mhp1s]

In [12]: BetPs = ['2WITA', '2WITB', '2WITC', '3P03A', '3P03B', '3P03C',
   ....:          '4AINA', '4AINB', '4AINC', '4C7RA', '4C7RB', '4C7RC',
   ....:          '4DOJA', '4DOJB', '4DOJC', '4LLHA', '4LLHB', '4LLHC']
   ....: 

In [13]: AdiCs = ['3L1L', '3LRB', '3LRC', '3NCY', '3OB6', '5J4I', '5J4N']

In [14]: AdiCs = [protID + 'A' for protID in AdiCs]

In [15]: CaiTs = ['4M8JA', '2WSXA', '2WSXB', '2WSXC', '2WSWA', '3HFXA']

parsePDB() allows us to parse multiple structures all at once, and we can use it to load all the PDB structures into ProDy in one line. We only need the alpha carbon for our purpose, so we set subset='ca':

In [16]: pdb_ids = LeuTs + DATs + MhsTs + vSGLTs + Mhp1s + BetPs + AdiCs + CaiTs

In [17]: ags = parsePDB(pdb_ids)

In [18]: len(ags)
Out[18]: 103

Any element in the list ags should be an AtomGroup instance. We can conveniently feed this list to buildPDBEnsemble() and let it build an PDBEnsemble for downstream analyses. We set mapping=ce to tell the function to use a structure alignment algorithm, CEalign [IS98], for building the ensemble. We also set seqid=0 and overlap=0 to make sure we apply no threshold of sequence identity or coverage/overlap to the building process.

In [19]: ens = buildPDBEnsemble(ags, mapping='ce', seqid=0, overlap=0, title='LeuT', subset='ca')

In [20]: ens
Out[20]: <PDBEnsemble: LeuT (103 conformations; 510 atoms)>

Finally we save the ensemble for later processing:

In [21]: saveEnsemble(ens, 'LeuT')
Out[21]: 'LeuT.ens.npz'

A more refined alignment procedure was adopted in the [SZ18] paper. A representative structure is chosen from each subtype of the proteins, e.g. LeuT, DAT, etc., and they are aligned to the LeuT representative using CEalign [IS98]. Then the rest are aligned to the representative structure of their own kind using the pairwise alignment algorithm because they are sequentially the same despite small differences.

[YS13]

Shi Y. Common folds and transport mechanisms of secondary active transporters. Annu. Rev. Biophys. 2013 42:51-72