Sequence Analysis

Evol component of ProDy package brought new fast, flexible, and efficient features for handling multiple sequence alignments and analyzing sequence evolution.

Accessing Pfam

First, let’s fetch an MSA file from Pfam database:

In [3]:
filename = fetchPfamMSA('pkinase', alignment='seed')
@> Pfam MSA for pkinase is written as pkinase_seed.sth.
In [5]:
filename
Out[5]:
'pkinase_seed.sth'

Parse MSA

In [6]:
msa = parseMSA(filename)
@> 38 sequence(s) with 419 residues were parsed in 0.02s.
In [7]:
msa
Out[7]:
<MSA: pkinase_seed (38 sequences, 419 residues)>

Sequences

You can access Sequence objects by indexing MSA:

In [8]:
seq = msa[0]
In [9]:
seq
Out[9]:
<Sequence: CDC15_YEAST (pkinase_seed[0]; length 419; 248 residues and 171 gaps)>
In [10]:
print seq
YHLKQVIGRGSYGVV..YKAINKHTDQVVAIKEVVYENDEELND........IMAEISLLKNLN...HNNIVKYHGFIRK.............SYELY..ILLEYCANGS....LRRLIS.......RSSTGLSEN.......ESK.TYVTQTLLGLKYLH.GEG...VIHRDIKAANILLS............ADN.......TVKLADFGVSTIVNSS.....................ALTLAGTLNWMAPEIL..........GN...RGA.STLSDIWSLGATVVEMLTKNPPYHNLTDANIYYAVEND.........................TYYPPSSF....................................SEPLKDFLSKCFVKNMYKRPT.....ADQLLKHVWI

Analysis

Evol component includes several functions for calculating conservation and coevolution properties of amino acids

In [11]:
occ = calcMSAOccupancy(msa, count=True)
occ.min()
Out[11]:
1.0

This shows that an amino acid is present only in one of the sequences in the MSA.

In [12]:
showMSAOccupancy(msa, count=True);

Sequence-Structure Comparison with Evol

The part shows how to compare sequence conservation properties with structural mobility obtained from Gaussian network model (GNM) calculations.

Setting up environment

In [1]:
from prody import *
from pylab import *
%matplotlib inline
//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Entropy Calculation

First, we retrieve MSA for protein for protein family PF00074:

In [2]:
fetchPfamMSA('PF00074')
@> Pfam MSA for PF00074 is written as PF00074_full.sth.
Out[2]:
'PF00074_full.sth'

We parse the MSA file:

In [3]:
msa = parseMSA('PF00074_full.sth')
@> 645 sequence(s) with 248 residues were parsed in 0.01s.

Then, we refine it using refineMSA() based on the sequence of RNAS1_BOVIN:

In [4]:
msa_refine = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)
@> Label refinement reduced number of columns from 248 to 119 in 0.00s.
@> Row occupancy refinement reduced number of rows from 645 to 583 in 0.01s.
@> Sequence identity refinement reduced number of rows from 583 to 485 in 0.07s.

We calculate entropy for refined MSA.

In [5]:
entropy = calcShannonEntropy(msa_refine)
In [6]:
showShannonEntropy(msa_refine)
Out[6]:
<Container object of 119 artists>

Mobility calculation

Next, we obtain residue fluctuations or mobility for protein member of the above family. We will use chain B of 2W5I.

In [7]:
pdb = parsePDB('2W5I', chain='B')

chB_ca = pdb.select('protein and name CA and resid 1 to 119')
@> Connecting wwPDB FTP server RCSB PDB (USA).
@> 2w5i downloaded (2w5i.pdb.gz)
@> PDB download via FTP completed (1 downloaded, 0 failed).
@> 973 atoms and 1 coordinate set(s) were parsed in 0.03s.

We perform GNM as follows:

In [8]:
gnm = GNM('2W5I')

gnm.buildKirchhoff(chB_ca)

gnm.calcModes(n_modes=None)  # calculate all modes
@> Kirchhoff was built in 0.01s.
@> 118 modes were calculated in 0.05s.
In [9]:
mobility_1 = calcSqFlucts(gnm[0])

mobility_1to8 = calcSqFlucts(gnm[:8])

mobility_all = calcSqFlucts(gnm[:])
In [10]:
showSqFlucts(gnm[0])
Out[10]:
[<matplotlib.lines.Line2D at 0x11381d550>]
In [12]:
showSqFlucts(gnm[:], hinge=False)
Out[12]:
[<matplotlib.lines.Line2D at 0x113cb87d0>]

Plotting

In [24]:
indices = range(1,120)

bar(indices, entropy, width=1.2, color='grey', hold='True');

xlim(min(indices)-1, max(indices)+1);

plot(indices, mobility_all*(max(entropy)/mean(mobility_all)), color='b',
linewidth=2);

xlabel('residue index')
ylabel('mobility/entropy')
Out[24]:
<matplotlib.text.Text at 0x114efcf10>

Coevolution Calculation

In [25]:
mutinfo = buildMutinfoMatrix(msa_refine)
@> Mutual information matrix was calculated in 0.04s.
In [26]:
showMutinfoMatrix(msa_refine)
@> Mutual information matrix was calculated in 0.03s.
Out[26]:
(<matplotlib.image.AxesImage at 0x11579b410>,
 <matplotlib.colorbar.Colorbar at 0x11592a190>)
In [27]:
coevol = buildDirectInfoMatrix(msa_refine)
@> DI matrix was calculated in 1.46s.
In [28]:
showDirectInfoMatrix(msa_refine)
@> DI matrix was calculated in 1.35s.
Out[28]:
(<matplotlib.image.AxesImage at 0x110844e50>,
 <matplotlib.colorbar.Colorbar at 0x110f22190>)
In [29]:
showContactMap(gnm)
Out[29]:
<matplotlib.image.AxesImage at 0x10453d510>
In [30]:
showCrossCorr(gnm)
Out[30]:
(<matplotlib.image.AxesImage at 0x111845790>,
 <matplotlib.colorbar.Colorbar at 0x1118d44d0>)