This part shows how to parse, refine, filter, slice, and write MSA files.
In : from prody import * In : from matplotlib.pylab import * In : ion() # turn interactive mode on
Let’s get Pfam MSA file for protein family that contains PIWI_ARCFU:
In : searchPfam('PIWI_ARCFU').keys() Out: ['PF02171']
In : fetchPfamMSA('PF02171', alignment='seed') Out: 'PF02171_seed.sth'
Parsing MSA files¶
In : msafile = 'PF02171_seed.sth' In : msafobj = MSAFile(msafile) In : msafobj Out: <MSAFile: PF02171_seed (Stockholm; mode 'rt')> In : msa_seq_list = list(msafobj) In : msa_seq_list Out: <Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>
In : fetchPfamMSA('PF02171', compressed=True) Out: 'PF02171_full.sth.gz' In : parseMSA('PF02171_full.sth.gz') Out: <MSA: PF02171_full (2067 sequences, 1392 residues)> In : fetchPfamMSA('PF02171', format='fasta') Out: 'PF02171_full.fasta.gz' In : parseMSA('PF02171_full.fasta.gz') Out: <MSA: PF02171_full (2067 sequences, 1392 residues)>
Iterating over a file will yield sequence id, sequence, residue start and end indices:
In : msa = MSAFile('PF02171_seed.sth') In : for seq in msa: ....: seq ....:
Filtering and Slicing¶
Any function that takes label and sequence arguments and returns a boolean value can be used for filtering the sequences. A sequence will be yielded if the function returns True. In the following example, sequences from organism ARATH are filtered:
In : msafobj = MSAFile(msafile, filter=lambda lbl, seq: 'ARATH' in lbl) In : for seq in msafobj: ....: seq.getLabel() ....:
A list of integers can be used to slice sequences as follows. This enables selective parsing of the MSA file.
In : msafobj = MSAFile(msafile, slice=list(range(10)) + list(range(374,384))) In : list(msafobj) Out: <Sequence: TAG76_CAEEL (length 20; 19 residues and 1 gaps)>
Retrieving a sequence at a given index, or by id will give an object of
In : msa = parseMSA(msafile) In : seq = msa In : seq Out: <Sequence: TAG76_CAEEL (PF02171_seed; length 395; 307 residues and 88 gaps)> In : str(seq) Out: 'CIIVVLQS.KNSDI.YMTVKEQSDIVHGIMSQCVLMKNVSRP.........TPATCANIVLKLNMKMGGIN..SRIVADKITNKYLVDQPTM.........VVGIDVTHPTQAEM.......RMNMPSVAAIVANVD.LLPQSYGANVKVQKKCRESVVY.LLD............AIRERIITFYRHTKQ.KPARIIVYRDGVSEGQFSEVLREEIQSIRTACL.......AIAEDFR..PPITYIVVQKRHHARIFCKYQNDM.....................VGKAKNVPPGTTV...DTGIVSPEGFDFYLCSHYGVQGTSRPARYHVLLDECKFTADEI.QSI......TYGMCHTYGRCT....RSVSIPTPVYYADLVATRARCHVK'
Retrieve a sequence by UniProt ID:
In : msa['YQ53_CAEEL'] Out: <Sequence: YQ53_CAEEL (PF02171_seed; length 395; 328 residues and 67 gaps)>
You can query whether a sequence in contained in the instance using the UniProt identifier of the sequence as follows:
In : 'YQ53_CAEEL' in msa Out: True
Slice an MSA instance to give a new
MSA. object :
In : new_msa = msa[:2] In : new_msa Out: <MSA: PF02171_seed' (2 sequences, 395 residues)>
Slice using a list of UniProt IDs:
In : msa[:2] == msa[['TAG76_CAEEL', 'O16720_CAEEL']] Out: True
Retrieve a character or a slice of a sequence:
In : msa[0,0] Out: <Sequence: TAG76_CAEEL (length 1; 1 residues and 0 gaps)> In : msa[0,0:10] Out: <Sequence: TAG76_CAEEL (length 10; 9 residues and 1 gaps)>
Slice MSA rows and columns:
In : msa[:10,20:40] Out: <MSA: PF02171_seed' (10 sequences, 20 residues)>
mergeMSA() can be used to merge two or more MSAs. Based on their labels
only those sequences that appear in both MSAs are retained, and concatenated
horizontally to give a joint or merged MSA. This can be useful while evaluating
covariance patterns for proteins with multiple domains or protein-protein
interactions. The example shows merging for the multi-domain receptor
3KG2 containing pfam domains PF01094 and PF00497.
In : fetchPfamMSA('PF00017', alignment='seed') Out: 'PF00017_full.slx.gz'
In : fetchPfamMSA('PF07714', alignment='seed') Out: 'PF07714_full.slx.gz'
Let’s parse and merge the two files:
In : msa1 = parseMSA('PF00017_seed.sth') In : msa1 Out: <MSA: PF00017_seed (58 sequences, 109 residues)> In : msa2 = parseMSA('PF07714_seed.sth') In : msa2 Out: <MSA: PF07714_seed (145 sequences, 484 residues)> In : merged = mergeMSA(msa1, msa2) In : merged Out: <MSA: PF00017_seed + PF07714_seed (14 sequences, 593 residues)>
Merged MSA contains 14 sequences.
writeMSA() can be used to write MSA. It takes filename as input
which should contain appropriate extension that can be
".fasta" or format should be specified as
"FASTA". Input MSA should be
MSA object. Filename can contain
".gz" extension, in which case
a compressed file will be written.
In : writeMSA('sliced_MSA.gz', msa, format='SELEX') Out: 'sliced_MSA.gz' In : writeMSA('sliced_MSA.fasta', msafobj) Out: 'sliced_MSA.fasta'
writeMSA() returns the name of the MSA file that is written.