Data Collection with CATH

Searching CATH

We can also use the CATHDB class to find a particular part of the CATH hierarchy by CATH ID:

In [17]: node = cath.find('1.10.8')

In [18]: node.name
Out[18]: 'Helicase, Ruva Protein; domain 3'

We can also then examine its children:

In [19]: node.getchildren().name
Out[19]: 
['DNA helicase RuvA subunit, C-terminal domain',
 'N-terminal domain of phosphatidylinositol transfer protein sec14p',
 'Albumin-binding domain',
 'Superfamily 1.10.8.50',
 'Superfamily 1.10.8.60',
 'Glutamate-tRNA synthetase, class I, anticodon-binding domain 1',
 'Magnesium chelatase subunit I, C-Terminal domain',
 'Ribosomal RNA adenine dimethylase-like, domain 2',
 'Photosystem I PsaF, reaction centre subunit III',
 'Superfamily 1.10.8.130',
 'PDCD5-like',
 'DNA primase S; domain 2',
 'Carbon monoxide dehydrogenase alpha subunit. Chain M, domain 1',
 'Replisome organizer (g39p helicase loader/inhibitor protein)',
 'Sirohaem synthase, dimerisation domain',
 'CofD-like domain',
 'DNA polymerase III clamp loader domain like',
 'HI0933 insert domain-like',
 'putative rabgap domain of human tbc1 domain family member 14 like domains',
 'ABC transporter ATPase domain-like',
 'uncharacterized protein sp1917 domain',
 'PG0816-like',
 'PG0816-like',
 'Bacterial muramidase',
 '3,6-anhydro-alpha-l-galactosidase',
 'nsp7 replicase',
 'Uncharacterised protein PF01937, DUF89, domain 1',
 'Internalin N-terminal Cap domain-like',
 'Enoyl acyl carrier protein reductase',
 'RecR Domain 1',
 'Helical domain of apoptotic protease-activating factors',
 'Vesicular stomatitis virus phosphoprotein C-terminal domain',
 'ppGaNTase-T1 linker domain-like',
 'Superfamily 1.10.8.470',
 'Superfamily 1.10.8.480',
 'Ced-4 linker helical domain-like',
 'HAMP domain in histidine kinase',
 'ExsD N-terminal domain-like',
 'DNA polymerase alpha-primase, subunit B, N-terminal domain',
 'FHIPEP family, domain 3',
 'Proto-chlorophyllide reductase 57 kD subunit B',
 'Antirestriction protein ArdA, domain 2',
 'Hypothetical protein YfmB',
 'Superfamily 1.10.8.590',
 'Phage phi29 replication organiser protein p16.7-like',
 'SirC, precorrin-2 dehydrogenase, C-terminal helical domain-like',
 'ORF12 helical bundle domain-like',
 'Cytochrome C biogenesis protein',
 'Uncharacterised protein PF13642 yp_926445, C-terminal domain',
 'Superfamily 1.10.8.660',
 'Ypt/Rab-GAP domain of gyp1p, domain 2',
 'Bacteriophage clamp loader A subunit, A domain',
 'Superfamily 1.10.8.710',
 'Region D6 of dynein motor',
 'Superfamily 1.10.8.730',
 'Superfamily 1.10.8.740',
 'Phosphoribosylformylglycinamidine synthase, linker domain',
 'Haem-binding uptake, Tiki superfamily, ChaN, domain 2',
 'Superfamily 1.10.8.770',
 'RNA-dependent RNA polymerase, slab domain, helical subdomain-like',
 'D-family DNA polymerase, DP1 subunit N-terminal domain',
 'Daxx helical bundle domain',
 'Ribosome-associated complex head domain',
 'Histone-lysine N methyltransferase , C-terminal domain-like',
 'Enterocin 7a-like',
 'Alpha-glycerophosphate oxidase, cap domain',
 'Birnavirus VP3 protein, domain 2',
 'Superfamily 1.10.8.890',
 'Superfamily 1.10.8.900',
 'Protein of unknown function DUF1465',
 'Uncharacterised protein, phage p2 ORF12',
 'Filoviridae VP35, C-terminal inhibitory domain, helical subdomain',
 'Superfamily 1.10.8.960',
 'Flavivirus envelope glycoprotein M-like',
 'Superfamily 1.10.8.990',
 'Ornithine 4,5 aminomutase S component, alpha subunit-like',
 'Superfamily 1.10.8.1010',
 'RecQ-mediated genome instability protein 1, N-terminal domain',
 'Superfamily 1.10.8.1040',
 'Antitoxin VbhA-like',
 'Corynebacterium glutamicum thioredoxin-dependent arsenate reductase, N-terminal domain',
 'Superfamily 1.10.8.1070',
 'Superfamily 1.10.8.1080',
 'Histone RNA hairpin-binding protein RNA-binding domain',
 'Bacterial toxin RNase RnlA/LsoA, C-terminal Dmd-binding domain',
 'Superfamily 1.10.8.1160',
 'Superfamily 1.10.8.1170',
 'Superfamily 1.10.8.1180',
 'Superfamily 1.10.8.1190',
 'Superfamily 1.10.8.1210',
 'Superfamily 1.10.8.1220',
 'Superfamily 1.10.8.1240',
 'Glutaminyl-tRNA synthetase, non-specific RNA binding region part 1, domain 1',
 'Superfamily 1.10.8.1310',
 'Superfamily 1.10.8.1320',
 'Intein homing endonuclease, domain III']

Lastly, the CATHDB object can be used to find different CATH domains within a particular PDB structure:

In [20]: result = cath.search('3kg2A')

In [21]: result.name
Out[21]: 
['Superfamily 1.10.287.70',
 'Superfamily 3.40.50.2300',
 'Superfamily 3.40.50.2300',
 'Periplasmic binding protein-like II',
 'Periplasmic binding protein-like II']
In [22]: result.getSelstrs()
Out[22]: 
['resindex 500 to 621 or resindex 777 to 808',
 'resindex 104 to 239 or resindex 346 to 376',
 'resindex 1 to 103 or resindex 240 to 345',
 'resindex 384 to 489 or resindex 724 to 773',
 'resindex 490 to 499 or resindex 622 to 723']

This iGluR example also illustrates that CATH domains may also not correspond to biological domains identified by other methods.

The N-terminal domain (NTD; residues 1 to 376), a type-I PBP domain, is split into CATH domains corresponding to the two lobes, which each belong to ‘Superfamily 3.40.50.2300’.

Likewise, the two lobes of the ligand-binding domain (LBD) are assigned as separate domains that both belong to ‘Periplasmic binding protein-like II’, which is usually the whole bi-lobed clamshell structure.

Getting atomic structures from CATH and building ensembles

We can also get PDB IDs associated with particular levels:

In [23]: node = cath.find('1.10.8.40')

In [24]: node.getPDBs()
Out[24]: 
['2j5yA',
 '2j5yB',
 '2vdbB',
 '1tf0B',
 '1gabA',
 '1prbA',
 '2n35A',
 '2jwsA',
 '2kdlA',
 '2lhcA',
 '2lhgA',
 '2mh8A',
 '2fs1A',
 '1gjsA',
 '1gjtA']

Two other useful methods retrieve the associated CATH domains and selection strings.

In [25]: node.getDomains()
Out[25]: 
[<CATHElement: 2j5yA00>,
 <CATHElement: 2j5yB00>,
 <CATHElement: 2vdbB00>,
 <CATHElement: 1tf0B00>,
 <CATHElement: 1gabA00>,
 <CATHElement: 1prbA00>,
 <CATHElement: 2n35A00>,
 <CATHElement: 2jwsA00>,
 <CATHElement: 2kdlA00>,
 <CATHElement: 2lhcA00>,
 <CATHElement: 2lhgA00>,
 <CATHElement: 2mh8A00>,
 <CATHElement: 2fs1A00>,
 <CATHElement: 1gjsA00>,
 <CATHElement: 1gjtA00>]
In [26]: node.getSelstrs()
Out[26]: 
['resindex 1 to 61',
 'resindex 1 to 61',
 'resindex 1 to 55',
 'resindex 1 to 53',
 'resindex 1 to 53',
 'resindex 1 to 53',
 'resindex 1 to 52',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 56',
 'resindex 1 to 65',
 'resindex 1 to 65']

We can combine all of these together to fetch and parse structures from the PDB and make the appropriate selections at the same time:

In [27]: proteins = node.parsePDBs(subset='ca')

In [28]: proteins
Out[28]: 
[<Selection: 'resindex 1 to 61' from 2j5yA_ca (60 atoms)>,
 <Selection: 'resindex 1 to 61' from 2j5yB_ca (60 atoms)>,
 <Selection: 'resindex 1 to 55' from 2vdbB_ca (54 atoms)>,
 <Selection: 'resindex 1 to 53' from 1tf0B_ca (52 atoms)>,
 <Selection: 'resindex 1 to 53' from 1gabA_ca (52 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 53' from 1prbA_ca (52 atoms)>,
 <Selection: 'resindex 1 to 52' from 2n35A_ca (51 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2jwsA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2kdlA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2lhcA_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2lhgA_ca (55 atoms; active #0 of 10 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2mh8A_ca (55 atoms; active #0 of 10 coordsets)>,
 <Selection: 'resindex 1 to 56' from 2fs1A_ca (55 atoms; active #0 of 20 coordsets)>,
 <Selection: 'resindex 1 to 65' from 1gjsA_ca (64 atoms; active #0 of 30 coordsets)>,
 <Selection: 'resindex 1 to 65' from 1gjtA_ca (64 atoms)>]

This then allows us to build a PDBEnsemble from them:

In [29]: ens = buildPDBEnsemble(proteins, mapping='CE')

In [30]: ens
Out[30]: <PDBEnsemble: Unknown (15 conformations; 60 atoms)>