Evol component of ProDy package brought new fast, flexible, and efficient features for handling multiple sequence alignments and analyzing sequence evolution.
First, let’s fetch an MSA file from Pfam database:
filename = fetchPfamMSA('pkinase', alignment='seed')
@> Pfam MSA for pkinase is written as pkinase_seed.sth.
msa = parseMSA(filename)
@> 38 sequence(s) with 419 residues were parsed in 0.02s.
<MSA: pkinase_seed (38 sequences, 419 residues)>
You can access Sequence objects by indexing MSA:
seq = msa
<Sequence: CDC15_YEAST (pkinase_seed; length 419; 248 residues and 171 gaps)>
Evol component includes several functions for calculating conservation and coevolution properties of amino acids
occ = calcMSAOccupancy(msa, count=True) occ.min()
This shows that an amino acid is present only in one of the sequences in the MSA.
The part shows how to compare sequence conservation properties with structural mobility obtained from Gaussian network model (GNM) calculations.
from prody import * from pylab import * %matplotlib inline
//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
First, we retrieve MSA for protein for protein family PF00074:
@> Pfam MSA for PF00074 is written as PF00074_full.sth.
We parse the MSA file:
msa = parseMSA('PF00074_full.sth')
@> 645 sequence(s) with 248 residues were parsed in 0.01s.
Then, we refine it using refineMSA() based on the sequence of RNAS1_BOVIN:
msa_refine = refineMSA(msa, label='RNAS1_BOVIN', rowocc=0.8, seqid=0.98)
@> Label refinement reduced number of columns from 248 to 119 in 0.00s. @> Row occupancy refinement reduced number of rows from 645 to 583 in 0.01s. @> Sequence identity refinement reduced number of rows from 583 to 485 in 0.07s.
We calculate entropy for refined MSA.
entropy = calcShannonEntropy(msa_refine)
<Container object of 119 artists>
Next, we obtain residue fluctuations or mobility for protein member of the above family. We will use chain B of 2W5I.
pdb = parsePDB('2W5I', chain='B') chB_ca = pdb.select('protein and name CA and resid 1 to 119')
@> Connecting wwPDB FTP server RCSB PDB (USA). @> 2w5i downloaded (2w5i.pdb.gz) @> PDB download via FTP completed (1 downloaded, 0 failed). @> 973 atoms and 1 coordinate set(s) were parsed in 0.03s.
We perform GNM as follows:
gnm = GNM('2W5I') gnm.buildKirchhoff(chB_ca) gnm.calcModes(n_modes=None) # calculate all modes
@> Kirchhoff was built in 0.01s. @> 118 modes were calculated in 0.05s.
mobility_1 = calcSqFlucts(gnm) mobility_1to8 = calcSqFlucts(gnm[:8]) mobility_all = calcSqFlucts(gnm[:])
[<matplotlib.lines.Line2D at 0x11381d550>]
[<matplotlib.lines.Line2D at 0x113cb87d0>]
indices = range(1,120) bar(indices, entropy, width=1.2, color='grey', hold='True'); xlim(min(indices)-1, max(indices)+1); plot(indices, mobility_all*(max(entropy)/mean(mobility_all)), color='b', linewidth=2); xlabel('residue index') ylabel('mobility/entropy')
<matplotlib.text.Text at 0x114efcf10>
mutinfo = buildMutinfoMatrix(msa_refine)
@> Mutual information matrix was calculated in 0.04s.
@> Mutual information matrix was calculated in 0.03s.
(<matplotlib.image.AxesImage at 0x11579b410>, <matplotlib.colorbar.Colorbar at 0x11592a190>)
coevol = buildDirectInfoMatrix(msa_refine)
@> DI matrix was calculated in 1.46s.
@> DI matrix was calculated in 1.35s.
(<matplotlib.image.AxesImage at 0x110844e50>, <matplotlib.colorbar.Colorbar at 0x110f22190>)
<matplotlib.image.AxesImage at 0x10453d510>
(<matplotlib.image.AxesImage at 0x111845790>, <matplotlib.colorbar.Colorbar at 0x1118d44d0>)