Sequence Analysis

Evol component of ProDy package brought new fast, flexible, and efficient features for handling multiple sequence alignments and analyzing sequence evolution. Here, we just give a brief introduction to these features. For more detailed examples, see Evol Tutorial.

In [1]: from prody import *

In [2]: from pylab import *

In [3]: ion()

Access Pfam

First, let’s fetch an MSA file from Pfam database:

In [4]: filename = fetchPfamMSA('pkinase', alignment='seed')

In [5]: filename

We downloaded the seed alignment for Protein Kinase (Pkinase) family.

Parse MSA

As you might guess, we will parse this file using parseMSA() function:

In [6]: msa = parseMSA('pkinase_seed.sth')

In [7]: msa
Out[7]: <MSA: pkinase_seed (38 sequences, 419 residues)>

Sequences

You can access Sequence objects by indexing MSA:

In [8]: seq = msa[0]

In [9]: seq
Out[9]: <Sequence: TTK_HUMAN (pkinase_seed[0]; length 419; 267 residues and 152 gaps)>

In [10]: print(seq)
YSILKQIGSGGSSKV..FQVLN.EKKQIYAIKYVNLEEADNQTL......DSYRNEIAYLNKLQQ.HSDKIIRLYDYEIT.............DQYIY..MVMECGNID.....LNSWLK.......KKKS.IDPW.......ERK.SYWKNMLEAVHTIH.QHG...IVHSDLKPANFLIV............DGM........LKLIDFGIANQMQPDTTS................VVKDSQVGTVNYMPPEAIKDMSSSRENGKS...KSKISPKSDVWSLGCILYYMTYGKTPFQQIINQISKLHAIID.................................PNHEI.........................EFPDIPEKDLQDVLKCCLKRDPKQRIS.....IPELLAHPYV

You can also slice MSA objects and iterate over sequences:

In [11]: for seq in msa[:4]:
   ....:    print(repr(seq))
   ....: 
<Sequence: TTK_HUMAN (pkinase_seed[0]; length 419; 267 residues and 152 gaps)>
<Sequence: MKK1_YEAST (pkinase_seed[1]; length 419; 268 residues and 151 gaps)>
<Sequence: STE7_YEAST (pkinase_seed[2]; length 419; 276 residues and 143 gaps)>
<Sequence: BYR1_SCHPO (pkinase_seed[3]; length 419; 255 residues and 164 gaps)>

Analysis

Evol component includes several functions for calculating conservation and coevolution properties of amino acids, which are shown in Evol Tutorial. Here, let’s take a look at calcMSAOccupancy() and showMSAOccupancy() functions:

In [12]: occ = calcMSAOccupancy(msa, count=True)

In [13]: occ.min()
Out[13]: 1.0

This shows that an amino acid is present only in one of the sequences in the MSA.

In [14]: showMSAOccupancy(msa, count=True);
../../_images/prody_tutorial_sequence_occ.png

You see that many residues are not present in all sequences. You will see how to refine such MSA instances in Evol Tutorial.