MSA Files

This part shows how to parse, refine, filter, slice, and write MSA files.

In [1]: from prody import *

In [2]: from matplotlib.pylab import *

In [3]: ion()  # turn interactive mode on

Let’s get Pfam MSA file for protein family that contains PIWI_ARCFU:

In [4]: searchPfam('PIWI_ARCFU').keys()
Out[4]: ['PF02171']
In [5]: fetchPfamMSA('PF02171', alignment='seed')
Out[5]: 'PF02171_seed.sth'

Parsing MSA files

This shows how to use the MSAFile or parseMSA() to read the MSA file.

Reading using MSAFile yields an MSAFile object. Iterating over the object will yield an object of Sequence from which labels, sequence can be obtained.

In [6]: msafile = 'PF02171_seed.sth'

In [7]: msafobj = MSAFile(msafile)

In [8]: msafobj
Out[8]: <MSAFile: PF02171_seed (Stockholm; mode 'rt')>

In [9]: msa_seq_list = list(msafobj)

In [10]: msa_seq_list[0]
Out[10]: <Sequence: TAG76_CAEEL (length 395; 307 residues and 88 gaps)>

parseMSA() returns an MSA object. We can parse compressed files, but reading uncompressed files are much faster.

In [11]: fetchPfamMSA('PF02171', compressed=True)
Out[11]: 'PF02171_full.sth.gz'

In [12]: parseMSA('PF02171_full.sth.gz')
Out[12]: <MSA: PF02171_full (2067 sequences, 1392 residues)>

In [13]: fetchPfamMSA('PF02171', format='fasta')
Out[13]: 'PF02171_full.fasta.gz'

In [14]: parseMSA('PF02171_full.fasta.gz')
Out[14]: <MSA: PF02171_full (2067 sequences, 1392 residues)>

Iterating over a file will yield sequence id, sequence, residue start and end indices:

In [15]: msa = MSAFile('PF02171_seed.sth')

In [16]: for seq in msa:
   ....:     seq
   ....: 

Filtering and Slicing

This shows how to use the MSAFile object or MSA object to refine MSA using filters and slices.

Filtering

Any function that takes label and sequence arguments and returns a boolean value can be used for filtering the sequences. A sequence will be yielded if the function returns True. In the following example, sequences from organism ARATH are filtered:

In [17]: msafobj = MSAFile(msafile, filter=lambda lbl, seq: 'ARATH' in lbl)

In [18]: for seq in msafobj:
   ....:     seq.getLabel()
   ....: 

Slicing

A list of integers can be used to slice sequences as follows. This enables selective parsing of the MSA file.

In [19]: msafobj = MSAFile(msafile, slice=list(range(10)) + list(range(374,384)))

In [20]: list(msafobj)[0]
Out[20]: <Sequence: TAG76_CAEEL (length 20; 19 residues and 1 gaps)>

Slicing can also be done using MSA. The MSA object offers other functionalities like querying, indexing, slicing row and columns and refinement.

MSA objects

Indexing

Retrieving a sequence at a given index, or by id will give an object of Sequence:

In [21]: msa = parseMSA(msafile)

In [22]: seq = msa[0]

In [23]: seq
Out[23]: <Sequence: TAG76_CAEEL (PF02171_seed[0]; length 395; 307 residues and 88 gaps)>

In [24]: str(seq)
Out[24]: 'CIIVVLQS.KNSDI.YMTVKEQSDIVHGIMSQCVLMKNVSRP.........TPATCANIVLKLNMKMGGIN..SRIVADKITNKYLVDQPTM.........VVGIDVTHPTQAEM.......RMNMPSVAAIVANVD.LLPQSYGANVKVQKKCRESVVY.LLD............AIRERIITFYRHTKQ.KPARIIVYRDGVSEGQFSEVLREEIQSIRTACL.......AIAEDFR..PPITYIVVQKRHHARIFCKYQNDM.....................VGKAKNVPPGTTV...DTGIVSPEGFDFYLCSHYGVQGTSRPARYHVLLDECKFTADEI.QSI......TYGMCHTYGRCT....RSVSIPTPVYYADLVATRARCHVK'

Retrieve a sequence by UniProt ID:

In [25]: msa['YQ53_CAEEL']
Out[25]: <Sequence: YQ53_CAEEL (PF02171_seed[6]; length 395; 328 residues and 67 gaps)>

Querying

You can query whether a sequence in contained in the instance using the UniProt identifier of the sequence as follows:

In [26]: 'YQ53_CAEEL' in msa
Out[26]: True

Slicing

Slice an MSA instance to give a new MSA. object :

In [27]: new_msa = msa[:2]

In [28]: new_msa
Out[28]: <MSA: PF02171_seed' (2 sequences, 395 residues)>

Slice using a list of UniProt IDs:

In [29]: msa[:2] == msa[['TAG76_CAEEL', 'O16720_CAEEL']]
Out[29]: True

Retrieve a character or a slice of a sequence:

In [30]: msa[0,0]
Out[30]: <Sequence: TAG76_CAEEL (length 1; 1 residues and 0 gaps)>

In [31]: msa[0,0:10]
Out[31]: <Sequence: TAG76_CAEEL (length 10; 9 residues and 1 gaps)>

Slice MSA rows and columns:

In [32]: msa[:10,20:40]
Out[32]: <MSA: PF02171_seed' (10 sequences, 20 residues)>

Merging MSAs

mergeMSA() can be used to merge two or more MSAs. Based on their labels only those sequences that appear in both MSAs are retained, and concatenated horizontally to give a joint or merged MSA. This can be useful while evaluating covariance patterns for proteins with multiple domains or protein-protein interactions. The example shows merging for the multi-domain receptor 3KG2 containing pfam domains PF01094 and PF00497.

In [33]: fetchPfamMSA('PF00017', alignment='seed')
Out[33]: 'PF00017_full.slx.gz'
In [34]: fetchPfamMSA('PF07714', alignment='seed')
Out[34]: 'PF07714_full.slx.gz'

Let’s parse and merge the two files:

In [35]: msa1 = parseMSA('PF00017_seed.sth')

In [36]: msa1
Out[36]: <MSA: PF00017_seed (58 sequences, 109 residues)>

In [37]: msa2 = parseMSA('PF07714_seed.sth')

In [38]: msa2
Out[38]: <MSA: PF07714_seed (145 sequences, 484 residues)>

In [39]: merged = mergeMSA(msa1, msa2)

In [40]: merged
Out[40]: <MSA: PF00017_seed + PF07714_seed (14 sequences, 593 residues)>

Merged MSA contains 14 sequences.

Writing MSAs

writeMSA() can be used to write MSA. It takes filename as input which should contain appropriate extension that can be ".slx" or ".sth" or ".fasta" or format should be specified as "SELEX", "Stockholm" or "FASTA". Input MSA should be MSAFile or MSA object. Filename can contain ".gz" extension, in which case a compressed file will be written.

In [41]: writeMSA('sliced_MSA.gz', msa, format='SELEX')
Out[41]: 'sliced_MSA.gz'

In [42]: writeMSA('sliced_MSA.fasta', msafobj)
Out[42]: 'sliced_MSA.fasta'

writeMSA() returns the name of the MSA file that is written.