Data Collection with CATH

Release Notes

v1.11 series come with new and improved sequence, structure, and dynamics analysis features. See release notes for details.

How to Cite

Bakan A, Meireles LM, Bahar I ProDy: Protein Dynamics Inferred from Theory and Experiments
Bioinformatics 2011 27(11):1575-1577.

Bakan A, Dutta A, Mao W, Liu Y, Chennubhotla C, Lezon TR, Bahar I Evol and ProDy for Bridging Protein Sequence Evolution and Structural Dynamics
Bioinformatics 2014 30(18):2681-2683.

Data Collection with CATH¶

Navigating the CATH tree¶

The first step in signature dynamics analysis is to collect a set of related protein structures and build a PDBEnsemble. This can be achieved by multiple routes: a query search of the PDB using blastPDB() or searchDali(), extraction of PDB IDs from the Pfam or CATH database, or input of a pre-defined list.

Here, we demonstrate the usage of CATH for ensemble building.

First, make necessary imports from ProDy, NumPy and Matplotlib packages if you haven’t already.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

First, we initialise a CATHDB object. By default, this is done by downloading data from the CATH website.

In [4]: cath = CATHDB()

In [5]: cath
Out[5]: <CATHDB: 5 members>

We can also use this object to save this data to an .xml file and load it later:

In [6]: cath.save('cath.xml')

In [7]: cath = CATHDB('cath.xml')

One way of using the CATHDB class is to navigate the CATH tree, using modified versions of methods and properties inherited from base classes in :module:`~xml.etree.ElementTree`.

The root of the tree and all other elements in it are instances of the CATHElement class, which is based on Element, allowing us to easily navigate the CATH tree structure using parent/child relationships as follows:

In [8]: root = cath.root

In [9]: root
Out[9]: <CATHElement: root (5 members)>

In [10]: node = root.getchildren()

In [11]: node
Out[11]: 
<CATHCollection:
[<CATHElement: 1 (5 members)>
<CATHElement: 2 (21 members)>
<CATHElement: 3 (14 members)>
<CATHElement: 4 (1 members)>
<CATHElement: 6 (2 members)>]>

Any branching point node containing a collection of children is an instance of the CATHCollection class, which is based on the CATHElement class but has additional and modified properties and methods.

For example, collections return a list of values for the properties cath (CATH ID) and name, while elements return single values:

In [12]: node.name
Out[12]: 
['Mainly Alpha',
 'Mainly Beta',
 'Alpha Beta',
 'Few Secondary Structures',
 'Special']

In [13]: node.cath
Out[13]: ['1', '2', '3', '4', '6']

In [14]: element = node[0]

In [15]: element.name
Out[15]: 'Mainly Alpha'

In [16]: element.cath
Out[16]: '1'

Searching CATH¶

We can also use the CATHDB class to find a particular part of the CATH hierarchy by CATH ID:

In [17]: node = cath.find('1.10.8')

In [18]: node.name
Out[18]: 'Helicase, Ruva Protein; domain 3'

We can also then examine its children:

In [19]: node.getchildren().name
Out[19]: 
['DNA helicase RuvA subunit, C-terminal domain',
 'N-terminal domain of phosphatidylinositol transfer protein sec14p',
 'Albumin-binding domain',
 'Superfamily 1.10.8.50',
 'Superfamily 1.10.8.60',
 'Glutamate-tRNA synthetase, class I, anticodon-binding domain 1',
 'Magnesium chelatase subunit I, C-Terminal domain',
 'Ribosomal RNA adenine dimethylase-like, domain 2',
 'Photosystem I PsaF, reaction centre subunit III',
 'Superfamily 1.10.8.130',
 'PDCD5-like',
 'DNA primase S; domain 2',
 'Carbon monoxide dehydrogenase alpha subunit. Chain M, domain 1',
 'Replisome organizer (g39p helicase loader/inhibitor protein)',
 'Sirohaem synthase, dimerisation domain',
 'CofD-like domain',
 'DNA polymerase III clamp loader domain like',
 'HI0933 insert domain-like',
 'putative rabgap domain of human tbc1 domain family member 14 like domains',
 'ABC transporter ATPase domain-like',
 'uncharacterized protein sp1917 domain',
 'PG0816-like',
 'PG0816-like',
 'Bacterial muramidase',
 '3,6-anhydro-alpha-l-galactosidase',
 'nsp7 replicase',
 'Uncharacterised protein PF01937, DUF89, domain 1',
 'Internalin N-terminal Cap domain-like',
 'Enoyl acyl carrier protein reductase',
 'RecR Domain 1',
 'Helical domain of apoptotic protease-activating factors',
 'Vesicular stomatitis virus phosphoprotein C-terminal domain',
 'ppGaNTase-T1 linker domain-like',
 'Superfamily 1.10.8.470',
 'Superfamily 1.10.8.480',
 'Ced-4 linker helical domain-like',
 'HAMP domain in histidine kinase',
 'ExsD N-terminal domain-like',
 'DNA polymerase alpha-primase, subunit B, N-terminal domain',
 'FHIPEP family, domain 3',
 'Proto-chlorophyllide reductase 57 kD subunit B',
 'Antirestriction protein ArdA, domain 2',
 'Hypothetical protein YfmB',
 'Superfamily 1.10.8.590',
 'Phage phi29 replication organiser protein p16.7-like',
 'SirC, precorrin-2 dehydrogenase, C-terminal helical domain-like',
 'ORF12 helical bundle domain-like',
 'Cytochrome C biogenesis protein',
 'Uncharacterised protein PF13642 yp_926445, C-terminal domain',
 'Superfamily 1.10.8.660',
 'Ypt/Rab-GAP domain of gyp1p, domain 2',
 'Bacteriophage clamp loader A subunit, A domain',
 'Superfamily 1.10.8.710',
 'Region D6 of dynein motor',
 'Superfamily 1.10.8.730',
 'Superfamily 1.10.8.740',
 'Phosphoribosylformylglycinamidine synthase, linker domain',
 'Haem-binding uptake, Tiki superfamily, ChaN, domain 2',
 'Superfamily 1.10.8.770',
 'RNA-dependent RNA polymerase, slab domain, helical subdomain-like',
 'D-family DNA polymerase, DP1 subunit N-terminal domain',
 'Daxx helical bundle domain',
 'Ribosome-associated complex head domain',
 'Histone-lysine N methyltransferase , C-terminal domain-like',
 'Enterocin 7a-like',
 'Alpha-glycerophosphate oxidase, cap domain',
 'Birnavirus VP3 protein, domain 2',
 'Superfamily 1.10.8.890',
 'Superfamily 1.10.8.900',
 'Protein of unknown function DUF1465',
 'Uncharacterised protein, phage p2 ORF12',
 'Filoviridae VP35, C-terminal inhibitory domain, helical subdomain',
 'Superfamily 1.10.8.960',
 'Flavivirus envelope glycoprotein M-like',
 'Superfamily 1.10.8.990',
 'Ornithine 4,5 aminomutase S component, alpha subunit-like',
 'Superfamily 1.10.8.1010',
 'RecQ-mediated genome instability protein 1, N-terminal domain',
 'Superfamily 1.10.8.1040',
 'Antitoxin VbhA-like',
 'Corynebacterium glutamicum thioredoxin-dependent arsenate reductase, N-terminal domain',
 'Superfamily 1.10.8.1070',
 'Superfamily 1.10.8.1080',
 'Histone RNA hairpin-binding protein RNA-binding domain',
 'Bacterial toxin RNase RnlA/LsoA, C-terminal Dmd-binding domain',
 'Superfamily 1.10.8.1160',
 'Superfamily 1.10.8.1170',
 'Superfamily 1.10.8.1180',
 'Superfamily 1.10.8.1190',
 'Superfamily 1.10.8.1210',
 'Superfamily 1.10.8.1220',
 'Superfamily 1.10.8.1240',
 'Glutaminyl-tRNA synthetase, non-specific RNA binding region part 1, domain 1',
 'Superfamily 1.10.8.1310',
 'Superfamily 1.10.8.1320',
 'Intein homing endonuclease, domain III']

Lastly, the CATHDB object can be used to find different CATH domains within a particular PDB structure:

In [20]: result = cath.search('3kg2A')

In [21]: result.name
Out[21]: 
['Superfamily 1.10.287.70',
 'Superfamily 3.40.50.2300',
 'Superfamily 3.40.50.2300',
 'Periplasmic binding protein-like II',
 'Periplasmic binding protein-like II']

In [22]: result.getSelstrs()
Out[22]: 
['resindex 500 to 621 or resindex 777 to 808',
 'resindex 104 to 239 or resindex 346 to 376',
 'resindex 1 to 103 or resindex 240 to 345',
 'resindex 384 to 489 or resindex 724 to 773',
 'resindex 490 to 499 or resindex 622 to 723']

This iGluR example also illustrates that CATH domains may also not correspond to biological domains identified by other methods.

The N-terminal domain (NTD; residues 1 to 376), a type-I PBP domain, is split into CATH domains corresponding to the two lobes, which each belong to ‘Superfamily 3.40.50.2300’.

Likewise, the two lobes of the ligand-binding domain (LBD) are assigned as separate domains that both belong to ‘Periplasmic binding protein-like II’, which is usually the whole bi-lobed clamshell structure.

Getting atomic structures from CATH and building ensembles¶

We can also get PDB IDs associated with particular levels:

In [23]: node = cath.find('1.10.8.40')

In [24]: node.getPDBs()
Out[24]: 
['2j5yA',
 '2j5yB',
 '2vdbB',
 '1tf0B',
 '1gabA',
 '1prbA',
 '2n35A',
 '2jwsA',
 '2kdlA',
 '2lhcA',
 '2lhgA',
 '2mh8A',
 '2fs1A',
 '1gjsA',
 '1gjtA']

Two other useful methods retrieve the associated CATH domains and selection strings.

In [25]: node.getDomains()
Out[25]: 
[<CATHElement: 2j5yA00>,
 <CATHElement: 2j5yB00>,
 <CATHElement: 2vdbB00>,
 <CATHElement: 1tf0B00>,
 <CATHElement: 1gabA00>,
 <CATHElement: 1prbA00>,
 <CATHElement: 2n35A00>,
 <CATHElement: 2jwsA00>,
 <CATHElement: 2kdlA00>,
 <CATHElement: 2lhcA00>,
 <CATHElement: 2lhgA00>,
 <CATHElement: 2mh8A00>,
 <CATHElement: 2fs1A00>,
 <CATHElement: 1gjsA00>,
 <CATHElement: 1gjtA00>]

In [26]: node.getSelstrs()
Out[26]: 
['resindex 1 to 61',
 'resindex 1 to 61',
 'resindex 1 to 55',
 'resindex 1 to 53',
 'resindex 1 to 53',
 'resindex 1 to 53',
 'resindex 1 to 52',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 65',
 'resindex 1 to 65']

We can combine all of these together to fetch and parse structures from the PDB and make the appropriate selections at the same time:

In [27]: proteins = node.parsePDBs(subset='ca')

In [28]: proteins
Out[28]: 
[<Selection: 'resindex 1 to 61' from 2j5yA_ca (60 atoms)>,
 <Selection: 'resindex 1 to 61' from 2j5yB_ca (60 atoms)>,
 <Selection: 'resindex 1 to 55' from 2vdbB_ca (54 atoms)>,
 <Selection: 'resindex 1 to 53' from 1tf0B_ca (52 atoms)>,
 <Selection: 'resindex 1 to 53' from 1gabA_ca (52 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 53' from 1prbA_ca (52 atoms)>,
 <Selection: 'resindex 1 to 52' from 2n35A_ca (51 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2jwsA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2kdlA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2lhcA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2lhgA_ca (55 atoms; active #0 of 10 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2mh8A_ca (55 atoms; active #0 of 10 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2fs1A_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 65' from 1gjsA_ca (64 atoms; active #0 of 30 coordsets)>,
 <Selection: 'resindex 1 to 65' from 1gjtA_ca (64 atoms)>]

This then allows us to build a PDBEnsemble from them:

In [29]: ens = buildPDBEnsemble(proteins, mapping='CE')

In [30]: ens
Out[30]: <PDBEnsemble: Unknown (15 conformations; 60 atoms)>