Data Collection with CATH¶
Searching CATH¶
We can also use the CATHDB
class to find a particular part of the CATH hierarchy
by CATH ID:
In [17]: node = cath.find('1.10.8')
In [18]: node.name
Out[18]: 'Helicase, Ruva Protein; domain 3'
We can also then examine its children:
In [19]: node.getchildren().name
Out[19]:
['DNA helicase RuvA subunit, C-terminal domain',
'N-terminal domain of phosphatidylinositol transfer protein sec14p',
'Albumin-binding domain',
'Superfamily 1.10.8.50',
'Superfamily 1.10.8.60',
'Glutamate-tRNA synthetase, class I, anticodon-binding domain 1',
'Magnesium chelatase subunit I, C-Terminal domain',
'Ribosomal RNA adenine dimethylase-like, domain 2',
'Photosystem I PsaF, reaction centre subunit III',
'Superfamily 1.10.8.130',
'PDCD5-like',
'DNA primase S; domain 2',
'Carbon monoxide dehydrogenase alpha subunit. Chain M, domain 1',
'Replisome organizer (g39p helicase loader/inhibitor protein)',
'Sirohaem synthase, dimerisation domain',
'CofD-like domain',
'DNA polymerase III clamp loader domain like',
'HI0933 insert domain-like',
'putative rabgap domain of human tbc1 domain family member 14 like domains',
'ABC transporter ATPase domain-like',
'uncharacterized protein sp1917 domain',
'PG0816-like',
'PG0816-like',
'Bacterial muramidase',
'3,6-anhydro-alpha-l-galactosidase',
'nsp7 replicase',
'Uncharacterised protein PF01937, DUF89, domain 1',
'Internalin N-terminal Cap domain-like',
'Enoyl acyl carrier protein reductase',
'RecR Domain 1',
'Helical domain of apoptotic protease-activating factors',
'Vesicular stomatitis virus phosphoprotein C-terminal domain',
'ppGaNTase-T1 linker domain-like',
'Superfamily 1.10.8.470',
'Superfamily 1.10.8.480',
'Ced-4 linker helical domain-like',
'HAMP domain in histidine kinase',
'ExsD N-terminal domain-like',
'DNA polymerase alpha-primase, subunit B, N-terminal domain',
'FHIPEP family, domain 3',
'Proto-chlorophyllide reductase 57 kD subunit B',
'Antirestriction protein ArdA, domain 2',
'Hypothetical protein YfmB',
'Superfamily 1.10.8.590',
'Phage phi29 replication organiser protein p16.7-like',
'SirC, precorrin-2 dehydrogenase, C-terminal helical domain-like',
'ORF12 helical bundle domain-like',
'Cytochrome C biogenesis protein',
'Uncharacterised protein PF13642 yp_926445, C-terminal domain',
'Superfamily 1.10.8.660',
'Ypt/Rab-GAP domain of gyp1p, domain 2',
'Bacteriophage clamp loader A subunit, A domain',
'Superfamily 1.10.8.710',
'Region D6 of dynein motor',
'Superfamily 1.10.8.730',
'Superfamily 1.10.8.740',
'Phosphoribosylformylglycinamidine synthase, linker domain',
'Haem-binding uptake, Tiki superfamily, ChaN, domain 2',
'Superfamily 1.10.8.770',
'RNA-dependent RNA polymerase, slab domain, helical subdomain-like',
'D-family DNA polymerase, DP1 subunit N-terminal domain',
'Daxx helical bundle domain',
'Ribosome-associated complex head domain',
'Histone-lysine N methyltransferase , C-terminal domain-like',
'Enterocin 7a-like',
'Alpha-glycerophosphate oxidase, cap domain',
'Birnavirus VP3 protein, domain 2',
'Superfamily 1.10.8.890',
'Superfamily 1.10.8.900',
'Protein of unknown function DUF1465',
'Uncharacterised protein, phage p2 ORF12',
'Filoviridae VP35, C-terminal inhibitory domain, helical subdomain',
'Superfamily 1.10.8.960',
'Flavivirus envelope glycoprotein M-like',
'Superfamily 1.10.8.990',
'Ornithine 4,5 aminomutase S component, alpha subunit-like',
'Superfamily 1.10.8.1010',
'RecQ-mediated genome instability protein 1, N-terminal domain',
'Superfamily 1.10.8.1040',
'Antitoxin VbhA-like',
'Corynebacterium glutamicum thioredoxin-dependent arsenate reductase, N-terminal domain',
'Superfamily 1.10.8.1070',
'Superfamily 1.10.8.1080',
'Histone RNA hairpin-binding protein RNA-binding domain',
'Bacterial toxin RNase RnlA/LsoA, C-terminal Dmd-binding domain',
'Superfamily 1.10.8.1160',
'Superfamily 1.10.8.1170',
'Superfamily 1.10.8.1180',
'Superfamily 1.10.8.1190',
'Superfamily 1.10.8.1210',
'Superfamily 1.10.8.1220',
'Superfamily 1.10.8.1240',
'Glutaminyl-tRNA synthetase, non-specific RNA binding region part 1, domain 1',
'Superfamily 1.10.8.1310',
'Superfamily 1.10.8.1320',
'Intein homing endonuclease, domain III']
Lastly, the CATHDB
object can be used to find different CATH domains within
a particular PDB structure:
In [20]: result = cath.search('3kg2A')
In [21]: result.name
Out[21]:
['Superfamily 1.10.287.70',
'Superfamily 3.40.50.2300',
'Superfamily 3.40.50.2300',
'Periplasmic binding protein-like II',
'Periplasmic binding protein-like II']
In [22]: result.getSelstrs()
Out[22]:
['resindex 500 to 621 or resindex 777 to 808',
'resindex 104 to 239 or resindex 346 to 376',
'resindex 1 to 103 or resindex 240 to 345',
'resindex 384 to 489 or resindex 724 to 773',
'resindex 490 to 499 or resindex 622 to 723']
This iGluR example also illustrates that CATH domains may also not correspond to biological domains identified by other methods.
The N-terminal domain (NTD; residues 1 to 376), a type-I PBP domain, is split into CATH domains corresponding to the two lobes, which each belong to ‘Superfamily 3.40.50.2300’.
Likewise, the two lobes of the ligand-binding domain (LBD) are assigned as separate domains that both belong to ‘Periplasmic binding protein-like II’, which is usually the whole bi-lobed clamshell structure.
Getting atomic structures from CATH and building ensembles¶
We can also get PDB IDs associated with particular levels:
In [23]: node = cath.find('1.10.8.40')
In [24]: node.getPDBs()
Out[24]:
['2j5yA',
'2j5yB',
'2vdbB',
'1tf0B',
'1gabA',
'1prbA',
'2n35A',
'2jwsA',
'2kdlA',
'2lhcA',
'2lhgA',
'2mh8A',
'2fs1A',
'1gjsA',
'1gjtA']
Two other useful methods retrieve the associated CATH domains and selection strings.
In [25]: node.getDomains()
Out[25]:
[<CATHElement: 2j5yA00>,
<CATHElement: 2j5yB00>,
<CATHElement: 2vdbB00>,
<CATHElement: 1tf0B00>,
<CATHElement: 1gabA00>,
<CATHElement: 1prbA00>,
<CATHElement: 2n35A00>,
<CATHElement: 2jwsA00>,
<CATHElement: 2kdlA00>,
<CATHElement: 2lhcA00>,
<CATHElement: 2lhgA00>,
<CATHElement: 2mh8A00>,
<CATHElement: 2fs1A00>,
<CATHElement: 1gjsA00>,
<CATHElement: 1gjtA00>]
In [26]: node.getSelstrs()
Out[26]:
['resindex 1 to 61',
'resindex 1 to 61',
'resindex 1 to 55',
'resindex 1 to 53',
'resindex 1 to 53',
'resindex 1 to 53',
'resindex 1 to 52',
'resindex 1 to 56',
'resindex 1 to 56',
'resindex 1 to 56',
'resindex 1 to 56',
'resindex 1 to 56',
'resindex 1 to 56',
'resindex 1 to 65',
'resindex 1 to 65']
We can combine all of these together to fetch and parse structures from the PDB and make the appropriate selections at the same time:
In [27]: proteins = node.parsePDBs(subset='ca')
In [28]: proteins
Out[28]:
[<Selection: 'resindex 1 to 61' from 2j5yA_ca (60 atoms)>,
<Selection: 'resindex 1 to 61' from 2j5yB_ca (60 atoms)>,
<Selection: 'resindex 1 to 55' from 2vdbB_ca (54 atoms)>,
<Selection: 'resindex 1 to 53' from 1tf0B_ca (52 atoms)>,
<Selection: 'resindex 1 to 53' from 1gabA_ca (52 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 53' from 1prbA_ca (52 atoms)>,
<Selection: 'resindex 1 to 52' from 2n35A_ca (51 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 56' from 2jwsA_ca (55 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 56' from 2kdlA_ca (55 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 56' from 2lhcA_ca (55 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 56' from 2lhgA_ca (55 atoms; active #0 of 10 coordsets)>,
<Selection: 'resindex 1 to 56' from 2mh8A_ca (55 atoms; active #0 of 10 coordsets)>,
<Selection: 'resindex 1 to 56' from 2fs1A_ca (55 atoms; active #0 of 20 coordsets)>,
<Selection: 'resindex 1 to 65' from 1gjsA_ca (64 atoms; active #0 of 30 coordsets)>,
<Selection: 'resindex 1 to 65' from 1gjtA_ca (64 atoms)>]
This then allows us to build a PDBEnsemble
from them:
In [29]: ens = buildPDBEnsemble(proteins, mapping='CE')
In [30]: ens
Out[30]: <PDBEnsemble: Unknown (15 conformations; 60 atoms)>