# Sequence Analysis¶

Evol component of ProDy package brought new fast, flexible, and efficient features for handling multiple sequence alignments and analyzing sequence evolution.

### Accessing Pfam¶

First, let’s fetch an MSA file from Pfam database:

In [3]:
filename = fetchPfamMSA('pkinase', alignment='seed')

@> Pfam MSA for pkinase is written as pkinase_seed.sth.

In [5]:
filename

Out[5]:
'pkinase_seed.sth'

### Parse MSA¶

In [6]:
msa = parseMSA(filename)

@> 38 sequence(s) with 419 residues were parsed in 0.02s.

In [7]:
msa

Out[7]:
<MSA: pkinase_seed (38 sequences, 419 residues)>

### Sequences¶

You can access Sequence objects by indexing MSA:

In [8]:
seq = msa[0]

In [9]:
seq

Out[9]:
<Sequence: CDC15_YEAST (pkinase_seed[0]; length 419; 248 residues and 171 gaps)>
In [10]:
print seq

YHLKQVIGRGSYGVV..YKAINKHTDQVVAIKEVVYENDEELND........IMAEISLLKNLN...HNNIVKYHGFIRK.............SYELY..ILLEYCANGS....LRRLIS.......RSSTGLSEN.......ESK.TYVTQTLLGLKYLH.GEG...VIHRDIKAANILLS............ADN.......TVKLADFGVSTIVNSS.....................ALTLAGTLNWMAPEIL..........GN...RGA.STLSDIWSLGATVVEMLTKNPPYHNLTDANIYYAVEND.........................TYYPPSSF....................................SEPLKDFLSKCFVKNMYKRPT.....ADQLLKHVWI


### Analysis¶

Evol component includes several functions for calculating conservation and coevolution properties of amino acids

In [11]:
occ = calcMSAOccupancy(msa, count=True)
occ.min()

Out[11]:
1.0

This shows that an amino acid is present only in one of the sequences in the MSA.

In [12]:
showMSAOccupancy(msa, count=True);


# Sequence-Structure Comparison with Evol¶

The part shows how to compare sequence conservation properties with structural mobility obtained from Gaussian network model (GNM) calculations.

## Setting up environment¶

In [1]:
from prody import *
from pylab import *
%matplotlib inline

//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')


## Entropy Calculation¶

First, we retrieve MSA for protein for protein family PF00074:

In [2]:
fetchPfamMSA('PF00074')

@> Pfam MSA for PF00074 is written as PF00074_full.sth.

Out[2]:
'PF00074_full.sth'

We parse the MSA file:

In [3]:
msa = parseMSA('PF00074_full.sth')

@> 645 sequence(s) with 248 residues were parsed in 0.01s.


Then, we refine it using refineMSA() based on the sequence of RNAS1_BOVIN:

In [4]:
msa_refine = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)

@> Label refinement reduced number of columns from 248 to 119 in 0.00s.
@> Row occupancy refinement reduced number of rows from 645 to 583 in 0.01s.
@> Sequence identity refinement reduced number of rows from 583 to 485 in 0.07s.


We calculate entropy for refined MSA.

In [5]:
entropy = calcShannonEntropy(msa_refine)

In [6]:
showShannonEntropy(msa_refine)

Out[6]:
<Container object of 119 artists>

## Mobility calculation¶

Next, we obtain residue fluctuations or mobility for protein member of the above family. We will use chain B of 2W5I.

In [7]:
pdb = parsePDB('2W5I', chain='B')

chB_ca = pdb.select('protein and name CA and resid 1 to 119')

@> Connecting wwPDB FTP server RCSB PDB (USA).
@> 973 atoms and 1 coordinate set(s) were parsed in 0.03s.


We perform GNM as follows:

In [8]:
gnm = GNM('2W5I')

gnm.buildKirchhoff(chB_ca)

gnm.calcModes(n_modes=None)  # calculate all modes

@> Kirchhoff was built in 0.01s.
@> 118 modes were calculated in 0.05s.

In [9]:
mobility_1 = calcSqFlucts(gnm[0])

mobility_1to8 = calcSqFlucts(gnm[:8])

mobility_all = calcSqFlucts(gnm[:])

In [10]:
showSqFlucts(gnm[0])

Out[10]:
[<matplotlib.lines.Line2D at 0x11381d550>]
In [12]:
showSqFlucts(gnm[:], hinge=False)

Out[12]:
[<matplotlib.lines.Line2D at 0x113cb87d0>]

## Plotting¶

In [24]:
indices = range(1,120)

bar(indices, entropy, width=1.2, color='grey', hold='True');

xlim(min(indices)-1, max(indices)+1);

plot(indices, mobility_all*(max(entropy)/mean(mobility_all)), color='b',
linewidth=2);

xlabel('residue index')
ylabel('mobility/entropy')

Out[24]:
<matplotlib.text.Text at 0x114efcf10>

## Coevolution Calculation¶

In [25]:
mutinfo = buildMutinfoMatrix(msa_refine)

@> Mutual information matrix was calculated in 0.04s.

In [26]:
showMutinfoMatrix(msa_refine)

@> Mutual information matrix was calculated in 0.03s.

Out[26]:
(<matplotlib.image.AxesImage at 0x11579b410>,
<matplotlib.colorbar.Colorbar at 0x11592a190>)
In [27]:
coevol = buildDirectInfoMatrix(msa_refine)

@> DI matrix was calculated in 1.46s.

In [28]:
showDirectInfoMatrix(msa_refine)

@> DI matrix was calculated in 1.35s.

Out[28]:
(<matplotlib.image.AxesImage at 0x110844e50>,
<matplotlib.colorbar.Colorbar at 0x110f22190>)
In [29]:
showContactMap(gnm)

Out[29]:
<matplotlib.image.AxesImage at 0x10453d510>
In [30]:
showCrossCorr(gnm)

Out[30]:
(<matplotlib.image.AxesImage at 0x111845790>,
<matplotlib.colorbar.Colorbar at 0x1118d44d0>)