Upload
olivier-bastien
View
213
Download
1
Embed Size (px)
Citation preview
Future Generation Computer Systems 23 (2007) 410–427www.elsevier.com/locate/fgcs
The configuration space of homologous proteins: A theoretical and practicalframework to reduce the diversity of the protein sequence space after
massive all-by-all sequence comparisons
Olivier Bastiena,b, Philippe Ortetc, Sylvaine Royd, Eric Marechala,∗
a UMR 5168 CNRS-CEA-INRA-Universite Joseph Fourier, Laboratoire de Physiologie Cellulaire Vegetale, Departement Reponse et Dynamique Cellulaires,CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, Franceb Gene-IT, 147 avenue Paul Doumer, F-92500 Rueil-Malmaison, France
c Departement d’Ecophysiologie Vegetale et de Microbiologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, Franced Laboratoire Biologie, Informatique, Mathematiques, Departement Reponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Martyrs, F-38054
Grenoble cedex 09, France
Received 9 December 2005; received in revised form 30 March 2006; accepted 2 July 2006Available online 30 August 2006
Abstract
Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed,neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion ofknowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousandsof protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons,pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projectionof sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practicalsolution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in whichZ-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary modelraising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of thephylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees(TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projectsbased on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive proteinsequence comparison projects.c© 2006 Elsevier B.V. All rights reserved.
Keywords: Protein space; Protein sequence comparison; TULIP; Configuration space of homologous protein; CluSTr; Z-value
1. Introduction
In 2005, more than 3 million protein sequences obtainedfrom numerical translation of DNA coding sequences (thereading frames of genes) were electronically stored in public
∗ Corresponding address: UMR 5019 CNRS-CEA-INRA-Universite JosephFourier, Laboratoire de Physiologie Cellulaire Vegetale, Departement Reponseet Dynamique Cellulaire, CEA Grenoble, 17 rue des Martyrs, F-38054Grenoble cedex 09, France. Tel.: +33 04 38 78 49 85; fax: +33 04 38 78 5091.
E-mail address: [email protected] (E. Marechal).
0167-739X/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2006.07.016
databases. The production rate of virtual proteins is exponen-tial, fueled by international funding, laboratory networking andcompetition. Additional productivity gain is expected soon, fol-lowing a recent technological breakthrough [1], i.e. a DNA se-quencing method allowing a 100 fold increase in throughput,reading for instance 25 million bases of genetic code – the en-tire genome of some fungi – within hours. Publicity followingthe release of complete genomes raised higher expectationsthan what could be reasonably achieved in the short term. Thisis particularly true for the human species, with a hope of accel-erating the identification of genes involved in genetic diseases
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 411
or cancer; for pathogens, with a hope of identifying new fight-ing targets; and for crops, with a hope of improving productivityand quality. Meanwhile, chemical technologies made progressin the throughput of small molecule syntheses. Millions of com-pounds have been physically stored in chemolibraries; theirmolecular structures were electronically stored. Having in hand3D structures of protein active sites and of chemolibraries,methods are available to predict interactions in virtual dock-ing experiments (e.g. [2]). PubChem, a repository for moleculesacting on biological targets, was recently launched [3] and theUniProt protein knowledge base was recently upgraded to re-port toxic doses of small molecules on proteins [4]. Access to anocean of small molecular structures and to a deluge of biolog-ical sequences raised therefore an enthusiastic challenge: “Thegoal for the coming decades will be to explore the overlap be-tween chemistry space and protein space” [5]. No doubt gigan-tic computing power will be demanded to reach this chemoge-nomic horizon. Is this post-genomic prediction exuberant? Be-sides upstream quality of data and downstream methods bywhich interactions might be screened, can we at least definewhat we mean by protein and small molecule spaces? This pa-per focuses on our recent advances toward the definition of astable reference for a protein space and navigation map at thesequence level.
In recent years, large scale comparisons of hundreds ofthousands of protein sequences have been released usingthe calculation power of supercomputers or grid frameworks(Tables 1 and 2). Putting some order into thousands of proteinsfollowing these massive comparisons, using representationsbased on “proximity” criteria, have provided pragmatic rulesto reduce the initial diversity. For a long term use, updatingpipelines should prevent initial results from being lost andshould be easily carried out. A wealth of bioinformaticmethods exist to compare and sort proteins at the level offull-length sequences or of sequence sub-domains and motifs.Combining methods is wise to circumvent the limitations ofeach technique; however no combination is considered asa standard. Thus, although some attempts to organize theprotein space were undergone by pragmatic combinations ofbioinformatic methods, a representation based on such complexworkflows is risky on a long term perspective. Given thenumber of virtual protein sequences (soon exceeding two-digitmillions) and the long-term investment, organizing the proteinspace should start by a simple process, should be biologicallysound, theoretically rather than pragmatically defined so as tobe as explicit as possible, and should be statistically accurate,with the widest validity domain. Because of the CPU cost, suchpre-treatment of the virtual protein sequences should also beeasily incremented with newly released protein sequences.
Comparing protein primary sequences can be carried outfollowing two main methods: the first is based on multiplealignments, usually achieved on sets of sequences presumedto be homologous, the second is based on native pairwisecomparisons. Multiple alignment-based comparisons, raisingphylogenetic protein clusters, are expensive, have to becompletely repeated after each data update and are susceptibleto raise modified phylogeny reconstructions when proteins are
added or removed from the multiple alignment sets. Thus,although protein clusters based on phylogeny is appealing tobiologists, the reduction of protein space diversity based onmultiple alignments demands an exorbitant computing powerand does not allow the conservation of results through updating.For these reasons, multiple alignment-based comparisons donot appear as a practical way to assist the organization of theprotein space.
Few massive pairwise sequence comparisons have beenperformed, some of which being described in Tables 1 and2. Output of an all-by-all comparison of n protein sequencesis an n × n table (Fig. 1). According to the output tableprocessing, it can be either totally recomputed at each databaseupdate, i.e. {new + old}-by-{new + old}, or stored and updatedby computing complementary {new}-by-{new} and {new}-by-{old} tables. Information is extracted from the output table tohelp reduce complexity and diversity at the sequence level.Sets of sequences sharing features are named “clusters” [6]that can be used to extrapolate some functional informationfrom characterized to uncharacterized sequences. In mostcases, results were not exploited by the potential end-users,i.e. bench biologists, at the level one would expect followingsuch expensive experiments. Numerous projects, unlisted here,simply closed the internet portal providing access to theirresults two or three years after release. One reason for thepoor public success of massive comparisons lies in the difficultaccess to protein clusters for the bench biologists and the lackof biologically sound or explicit representation.
In this paper we analyzed the large scale comparisons ofhundreds of thousands of protein sequences that have beenpreviously carried out using the power of supercomputersor grid frameworks and the two general statistical modelsused to assess the alignment score significance. We presentthe configuration space of homologous protein sequences, orCSHP [7], as a projection of the virtual protein sequencespace that is theoretically supported by information theory [8,9]. It is constructed from pair-wise sequence alignment scoresand Z-values obtained after Monte Carlo simulations. In theCSHP, reduction of the protein diversity can be carried outaccording to evolutionary models raising phylogenetic clusters.The accuracy of the Darwinian topology underlying the CSHPprojection depends on the method used to perform sequencecomparisons, the amino acid scoring matrix used to weightthe alignments, the sequence randomization technique and nullmodel used for Monte Carlo simulations and the molecularevolution model used to reconstruct phylogenetic trees. Thespatial representation of proteins in the CSHP can thereforebe easily updated, and the CSHP accuracy and phylogenetictopology be upgraded by improving any of the implementedsub-models. Diversity and complexity of the protein worldis reduced by detecting clusters of phylogenetically relatedproteins, and clusters can be viewed as real phylogenetic trees,named TULIP trees, making sense to a bench biologist. TheCSHP projection aims at providing an underlying order tothe protein space, and to assist the definition of other sub-protein spaces at the structural (sequence similarity only) orfunctional (with links to annotation) levels, as well as subspaces
412 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Tabl
e1
Mas
sive
all-
by-a
llpr
otei
nse
quen
cepa
irw
ise
com
pari
sons
base
don
alig
nmen
tE-v
alue
stat
istic
s
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
ttab
leO
rgan
izat
ion
prin
cipl
es(p
ragm
atic
rule
vsth
eore
tical
spat
ial
proj
ectio
n)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
CO
G17
,967
(Oct
ober
1997
)>
170,
000
(Nov
embe
r20
05)
Initi
ally
:en
trie
sfr
om7
com
plet
ege
nom
es.
In20
05:
entr
ies
from
73co
mpl
ete
geno
mes
.
nana
BL
AST
PK
arlin
–A
ltsch
ulm
odel
for
E-v
alue
com
puta
tion
E-v
alue
naB
esth
it(B
eT)
ofea
chen
try,
com
pare
dto
alle
ntri
esbe
long
ing
toa
dist
inct
geno
me
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.Pr
agm
atic
rule
:co
nstr
uctio
nof
BeT
grap
hsin
clud
ing
entr
ies
inat
leas
tthr
eege
nom
es
Det
ectio
nof
cons
iste
ntpa
ttern
sin
BeT
grap
hsin
clud
ing
entr
ies
inat
leas
t3ge
nom
es+
insp
ectio
nfo
rpr
otei
ndo
mai
ns.N
ohi
erar
chy
prod
uced
.
Lis
tsof
entr
ies
belo
ngin
gto
clus
ters
(nam
edC
OG
)cl
assi
fied
byoc
curr
ence
/ab
senc
ein
each
geno
me
(spe
cies
profi
ling)
Initi
ally
none
.In
2001
,ate
stw
asin
trod
uced
base
don
anes
timat
eof
the
prob
abili
tyth
atan
entr
yis
assi
gned
toa
CO
Gby
chan
ce
Upd
ate
prot
ein
data
base
geno
me
byge
nom
e;co
mpu
teB
eTw
ithne
wen
trie
sag
ains
ten
trie
sfr
omge
nom
espr
evio
usly
com
pare
d;co
mpl
ete
and
incr
emen
tB
eTgr
aphs
;re
peat
clus
teri
ngfr
omne
wB
eTgr
aphs
;ap
ply
stat
istic
alte
ston
grap
hs
Las
tdat
abas
eup
date
attim
eof
wri
ting:
May
2003
.Upd
ate
freq
uenc
yof
the
CO
Gda
taba
sede
pend
son
the
rele
ase
freq
uenc
yof
relia
ble
com
plet
ege
nom
e.M
ore
freq
uent
upda
tes
ofth
een
try
refe
renc
esar
eun
derg
one.
[ 42–
45]
Tri
be31
1,25
7(M
arch
2003
)
CO
GE
NT
(83
geno
mes
)+
Swis
sPro
t
400
node
Com
paq
Alp
haD
S10
clus
ter
(8h)
naB
LA
STP
Kar
lin–
Alts
chul
mod
elfo
rE
-val
ueco
mpu
tatio
n
E-v
alue
E-v
alue
≤1.
10−
10A
lignm
ent
scor
ean
dE
-val
ues
proc
esse
dto
corr
ect
asym
met
ric
BL
AST
hits
and
scor
es
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.C
onve
rsio
nof
E-v
alue
tabl
ein
toM
arko
vM
atri
x
Mar
kov-
rand
om-
field
clus
teri
ngw
ithT
ribe
MC
Lal
gori
thm
.N
ohi
erar
chy
prod
uced
.
Clu
ster
sof
mul
tiple
gran
ular
ities
byal
teri
ngin
flatio
nva
lue
used
for
clus
teri
ng
naU
pdat
epr
otei
nda
taba
se;
com
pute
mis
sing
com
pari
sons
(old
byol
dan
dol
dby
new
);ap
ply
E-v
alue
cuto
ff;r
epea
tcl
uste
ring
from
new
Mar
kov
Mat
rix
No
curr
ent
publ
icac
cess
topr
otei
ncl
uste
rs
[ 46,
47]
No
curr
ent
publ
icac
cess
topr
otei
ncl
uste
rs
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 413
Tabl
e1
(con
tinue
d)
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
ttab
leO
rgan
izat
ion
prin
cipl
es(p
ragm
atic
rule
vsth
eore
tical
spat
ial
proj
ectio
n)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
Prot
oNet
94,1
53(J
uly
2002
)>
1,00
0,00
0(S
epte
mbe
r20
04)
Swis
sPro
t,la
tter
expa
nded
to TrE
MB
Lan
dev
entu
ally
to Uni
Prot
naB
LO
SUM
62B
LA
STP
Kar
lin–
Alts
chul
mod
elfo
rE
-val
ueco
mpu
tatio
n
E-v
alue
E-v
alue
≤10
E-v
alue
sPr
agm
atic
orga
niza
tion
ofda
ta/n
oex
plic
itsp
atia
lpr
ojec
tion.
Pseu
do-m
etri
csba
sed
onE
-val
ues.
Hie
rarc
hica
lde
tect
ion
ofcl
uste
rsfo
llow
ing
ano
n-su
perv
ised
proc
edur
e.D
egre
esof
gran
ular
ityar
ede
fined
bym
ergi
ngcl
uste
rsba
sed
onE
-val
ueav
erag
ede
fined
byth
ear
ithm
etic
,sq
uare
,ge
omet
ric
orha
rmon
icm
eans
com
pute
dw
ithE
-val
ues
take
nas
aki
ndof
met
rics
.
Clu
ster
sof
mul
tiple
gran
ular
ities
follo
win
gth
em
ergi
ngru
le.
Num
ber
and
size
ofcl
uste
rsva
ryw
ithth
em
etho
dof
mea
nco
mpu
tatio
n.D
efau
ltst
rate
gyis
unw
eigh
ted
pair
grou
pw
ithar
ithm
etic
mea
n(U
PGM
A),
gene
ratin
gU
PGM
tree
s(w
hich
are
not
phyl
ogen
etic
tree
s)
Non
-sup
ervi
sed
defin
ition
ofcl
uste
rhi
erar
chie
sba
sed
onau
tom
atic
cond
ensa
tion
rule
s
Upd
ate
prot
ein
data
base
;the
mas
sive
com
pari
son
upda
te,i
.e.
all
com
pari
sons
(old
+ne
wby
old
+
new
)or
only
mar
gina
l(ol
dby
new
and
new
byne
w),
isna
;the
upda
tepr
oced
ure
for
E-v
alue
sis
na;r
epea
tU
PGM
Acl
uste
ring
.
Las
tdat
abas
eup
date
:Se
ptem
ber
2004
.Upd
ate
freq
uenc
yis
curr
ently
once
ever
y6
mon
ths.
[ 48–
50]
(con
tinue
don
next
page
)
414 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Tabl
e1
(con
tinue
d)
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
ttab
leO
rgan
izat
ion
prin
cipl
es(p
ragm
atic
rule
vsth
eore
tical
spat
ial
proj
ectio
n)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
Prot
oMap
365,
174
(May
2000
)
Swis
sPro
t+
TrE
MB
LM
OSI
XPa
ralle
lsy
stem
and
clus
ter
ofPC
s
BL
OSU
M50 +
BL
OSU
M62
BL
AST
P+
FAST
A+
SW
Exp
ecta
tion
valu
efo
ran al
ignm
ent
scor
eco
mpu
ted
with
the
real
dist
ribu
tion
ofal
lsc
ores
obta
ined
with
one
ofth
eal
igne
dse
quen
ces
agai
nsta
llth
epr
otei
nda
taba
se;
for
BL
AST
P,E
-val
ueis
obta
ined
from
the
Kar
lin–A
ltsch
ulm
odel
E-v
alue
E-v
alue
≤1
Con
sens
usE
-val
ues
obta
ined
afte
rnu
mer
ical
norm
aliz
atio
nof
E-v
alue
sob
tain
edw
ithal
lal
ignm
ent
algo
rith
ms
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.Pr
agm
atic
rule
:de
tect
ion
ofgr
aphs
,in
whi
chth
ew
eigh
tof
aned
geco
nnec
ting
two
prot
eins
isth
eE
-val
ue(M
arko
vian
repr
esen
tatio
n).
Pseu
do-m
etri
csba
sed
onE
-val
ues:
cons
ensu
sE
-val
ues
are
take
nas
dist
ance
sfo
rtr
eere
pres
enta
tions
Hie
rarc
hica
lde
tect
ion
ofgr
aphs
star
ting
with
agr
anul
arity
dete
rmin
edby
mos
tst
ring
ent
E-v
alue
s(<
10−
100),
and
grow
ing
byde
tect
ing
rela
tions
with
E-v
alue
slo
wer
than
10−
95,10
−90
,et
c.10
0(=
1)
Gra
phs
with
E-v
alue
sas
edge
wei
ghts
and
E-v
alue
base
dde
ndro
gram
s(w
hich
are
not
phyl
ogen
etic
tree
s)
Stat
istic
alte
stap
plie
dto
dete
ctan
del
imin
ate
“pro
blem
atic
”an
d“p
ossi
bly
fals
e”co
nnec
tions
betw
een
unre
late
dpr
otei
ns.
Qua
lity
ofco
nnec
tions
isde
fined
asm
inus
the
log
ofth
ege
omet
ric
mea
nof
pair
wis
eE
-val
ues
betw
een
conn
ecte
dse
quen
ces
Upd
ate
prot
ein
data
base
;sa
veor
igin
alsc
ore
dist
ribu
tion
and
scor
esw
ithE
-val
ues
<1;
com
pute
mis
sing
com
pari
sons
(old
byol
dan
dol
dby
new
);in
crem
ent
scor
edi
stri
butio
n;co
mpu
teal
lE
-val
ues
with
the
new
scor
edi
stri
butio
n;ap
ply
stat
istic
alte
ston
E-v
alue
;de
fine
grap
hs;a
pply
stat
istic
alte
ston
grap
hs
Las
tdat
abas
eup
date
attim
eof
wri
ting:
May
2000
.
[ 51–
53]
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 415
Tabl
e1
(con
tinue
d)
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
ttab
leO
rgan
izat
ion
prin
cipl
es(p
ragm
atic
rule
vsth
eore
tical
spat
ial
proj
ectio
n)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
SIM
AP
>3,
500,
000
(Fro
m20
03to
Oct
ober
2005
)
Uni
Prot
+m
ips
nonr
edH
+PF
AM
+PD
B+
entr
ies
of com
plet
ege
nom
es
Ver
satil
epi
pelin
eru
nnin
gon
Sun
Gri
deng
ine
clus
ters
and
BO
INC
-bas
edgr
idsy
stem
s
BL
OSU
M50
FAST
A+
SWFA
STA
isus
edin
alo
wco
stpr
erun
tode
tect
best
hits
and
SWis
used
ina
seco
ndru
non
the
sele
cted
pair
sof
sequ
ence
sto
com
pute
mos
texa
ctal
ignm
ent
scor
es
E-v
alue
SW scor
e≥
80
Alig
nmen
tra
wpa
ram
eter
sin
clud
ing
SWsc
ore,
and
E-v
alue
s
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.C
onve
rsio
nof
E-v
alue
tabl
ein
toM
arko
vM
atri
x.D
etec
tion
ofbi
dire
ctio
nal
best
hits
.
Mar
kov-
rand
om-
field
clus
teri
ngw
ithA
mst
erda
mM
CL
algo
rith
m.
No
hier
arch
ypr
oduc
ed.
Clu
ster
sof
mul
tiple
gran
ular
ities
byal
teri
ngin
flatio
nva
lue
used
for
clus
teri
ng.
Clu
ster
sar
epr
ovid
edas
lists
ofen
trie
sth
atca
nbe
sort
edba
sed
onal
ignm
ent
para
met
ers
(e.g
.len
gth,
iden
tity,
scor
e,Z
-sco
re).
Lis
tis
used
toge
nera
tem
ultip
leal
ignm
ents
and
hidd
enM
arko
vm
odel
s.
naU
pdat
epr
otei
nda
taba
se;
com
pute
mis
sing
com
pari
sons
(old
byol
dan
dol
dby
new
);ap
ply
SWsc
ore
cuto
ff;r
epea
tcl
uste
ring
from
new
Mar
kov
Mat
rix
Las
tdat
abas
eup
date
attim
eof
wri
ting:
Oct
ober
2005
.Im
port
ant
upda
tefr
eque
ncy
iscu
rren
tly∼
bim
onth
lyw
ithpe
aks
ever
y6
mon
ths.
[ 54,
55]
(con
tinue
don
next
page
)
416 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Tabl
e1
(con
tinue
d)
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
ttab
leO
rgan
izat
ion
prin
cipl
es(p
ragm
atic
rule
vsth
eore
tical
spat
ial
proj
ectio
n)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
SYST
ER
SV
41,
168,
498
(Nov
embe
r20
03)
Swis
sPro
t+
TrE
MB
L+
entr
ies
of11
com
plet
eeu
kary
otic
geno
mes
Para
cel
Gen
eMat
cher
Wor
ksta
tion
naSW
Exp
ecta
tion
valu
efo
ran al
ignm
ent
scor
eco
mpu
ted
with
the
real
dist
ribu
tion
ofal
lsc
ores
obta
ined
with
one
ofth
eal
igne
dse
quen
ces
agai
nsta
llth
epr
otei
nda
taba
se
E-v
alue
E-v
alue
≤0.
05E
-val
ues
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.Pr
agm
atic
rule
:de
tect
ion
ofgr
aphs
,in
whi
chth
ew
eigh
tof
aned
geco
nnec
ting
two
prot
eins
isth
eE
-val
ue(M
arko
vian
repr
esen
tatio
n).
Pseu
do-m
etri
csba
sed
onE
-val
ues:
E-v
alue
sar
eta
ken
asdi
stan
ces
for
tree
repr
esen
tatio
ns.
SYST
Em
atic
RE
-Sea
rchi
ngcl
uste
ring
:st
epw
ise
cons
truc
tion
ofsi
ngle
linka
gecl
uste
ring
tree
s;hi
erar
chy
base
don
inte
rnal
stru
ctur
eof
tree
s;co
nstr
uctio
nof su
perf
amily
E-v
alue
grap
hs(E
-val
ues
asdi
stan
ces)
;su
bclu
ster
dete
ctio
nat
wea
kco
nnec
tions
.
Lis
tsof
entr
ies
belo
ngin
gto
clus
ters
;po
st-c
ompu
tatio
nof
DIA
LIG
Nm
ultip
leal
ignm
ents
and
ofth
eco
rres
pond
ing
unw
eigh
ted
pair
grou
pw
ithar
ithm
etic
mea
n(U
PGM
A)
tree
s.In
prev
ious
rele
ase,
mul
tiple
alig
nmen
tsw
ere
perf
orm
edus
ing
CL
UST
ALW
Clu
ster
cons
truc
tion
ises
timat
edse
lf-v
alid
atin
g
Upd
ate
prot
ein
data
base
;the
mas
sive
com
pari
son
upda
te,i
.e.
all
com
pari
sons
(old
+ne
wby
old
+
new
)or
only
mar
gina
l(ol
dby
new
and
new
byne
w),
isna
;the
upda
tepr
oced
ure
for
E-v
alue
sis
na;r
epea
tSY
STE
mat
icR
E-S
earc
hing
clus
teri
ng.
Las
tdat
abas
eup
date
attim
eof
wri
ting:
Nov
embe
r20
03.F
our
rele
ases
sinc
e19
99,
with
maj
orm
etho
dm
odifi
catio
ns(f
rom
BL
AST
Pto
SW).
Publ
icac
cess
esto
rele
ases
3an
d4.
[ 56–
58]
na:n
otav
aila
ble
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 417
Tabl
e2
Mas
sive
all-
by-a
llpr
otei
nse
quen
cepa
irw
ise
com
pari
sons
base
don
alig
nmen
tZ-v
alue
stat
istic
s
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
tta
ble
Org
aniz
atio
npr
inci
ples
(pra
gmat
icru
levs
theo
retic
alsp
atia
lpr
ojec
tion)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
Dec
rypt
hon
559,
275
(Jan
uary
2002
)
Swis
sPro
t+
TrE
MB
L+ en
trie
sof
76co
mpl
ete
geno
mes
Gri
dco
mpu
ting
mod
elru
nnin
gw
ithun
used
com
putin
gpo
wer
of75
,000
PCs
(200
hea
ch),
netw
ork
arch
itect
ure
corr
espo
nds
toa
virt
ual
calc
ulat
ion
pow
er>
40Te
raflo
ps
BL
OSU
M62
SWZ
-sco
reco
mpu
tatio
nby
50sh
uffli
ngof
each
alig
ned
sequ
ence
(100
shuf
fling
per
com
pari
son)
Z-s
core
Z-s
core
≥5
Alig
nmen
tra
wpa
ram
eter
san
dZ
-sco
re
naB
asic
defin
ition
base
don
Z-s
core
thre
shol
ds.
No
hier
arch
ypr
oduc
ed.
Clu
ster
sar
epr
ovid
edas
lists
ofen
trie
sth
atca
nbe
sort
edba
sed
onal
ignm
ent
para
met
ers
(e.g
.len
gth,
iden
tity,
scor
e,Z
-sco
re)
nana
Las
tdat
abas
eup
date
attim
eof
wri
ting:
Janu
ary
2002
[ 59]
Tera
Prot
240,
000
(Jul
y20
02)
entr
ies
of67
com
plet
ege
nom
es
645
node
Com
paq
ES4
5cl
uste
r(1
6to
512
proc
esso
rsus
edde
pend
ing
onth
ejo
bs;
76.3
hcu
mul
ated
time,
equi
vale
ntTe
ra)
PAM
10SW
Z-s
core
com
puta
tion
by10
0sh
uffli
ngof
each
alig
ned
sequ
ence
(200
shuf
fling
per
com
pari
son)
and
cons
erva
tion
ofth
em
inim
umZ
-sco
reob
tain
ed
Z-s
core
SW scor
e≥
220
Alig
nmen
tra
wpa
ram
eter
san
dZ
-sco
re
naB
asic
defin
ition
base
don
Z-s
core
thre
shol
ds.
No
hier
arch
ypr
oduc
ed.
Clu
ster
sar
epr
ovid
edas
lists
ofen
trie
sth
atca
nbe
sort
edba
sed
onal
ignm
ent
para
met
ers
(e.g
.len
gth,
iden
tity,
scor
e,Z
-sco
re)
naU
pdat
epr
otei
nda
taba
se;
com
pute
mis
sing
com
pari
sons
(old
byol
dan
dol
dby
new
);ap
ply
scor
ecu
toff
Las
tdat
abas
eup
date
attim
eof
wri
ting:
July
2003
with
entr
ies
of22
addi
tiona
lge
nom
es
[ 60]
(con
tinue
don
next
page
)
418 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Table
2(c
ontin
ued)
Proj
ect
nam
ePr
otei
nse
quen
ceD
BPa
irw
ise
com
pari
son
Div
ersi
tyre
duct
ion
and
clus
teri
ngU
pdat
ing
Ref
eren
cean
din
tern
etac
cess
Num
ber
ofno
n-re
dund
ant
sequ
ence
s(fi
rstr
elea
seda
te)
Sour
ceC
ompu
ting
reso
urce
sA
min
oac
idsc
orin
gm
atri
x
Alig
nmen
tal
gori
thm
Alig
nmen
tsta
tistic
sO
utpu
tta
ble
Org
aniz
atio
npr
inci
ples
(pra
gmat
icru
levs
theo
retic
alsp
atia
lpr
ojec
tion)
Clu
ster
ing
met
hod
and
prod
uced
hier
arch
y
Clu
ster
repr
esen
tatio
nC
lust
erst
atis
tics
Pipe
line
Freq
uenc
y
Mod
elSt
atis
tical
para
met
erC
utof
f
Phyt
oPro
t14
,723
(Jan
uary
2001
)
All
entr
ies
ofpl
ants
in Swis
sPro
t+
TrE
MB
L
Mul
tipro
cess
orSu
nSp
arc
serv
er45
00
naSW
Z-s
core
com
puta
tion
byn
shuf
fling
(var
ying
from
30to
600
toke
epco
ntro
lto
the
vari
ance
ofZ
)of
each
alig
ned
sequ
ence
and
cons
erva
tion
ofth
em
inim
umZ
-sco
reob
tain
ed.
Z-s
core
Z-s
core
>14
Alig
nmen
tra
wpa
ram
eter
san
dZ
-sco
re
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.U
seof
Z-s
core
tabl
efo
rpy
ram
idal
clas
sific
atio
n.Po
st-a
naly
sis
ofse
quen
ces
tode
tect
poss
ibly
shar
edsu
b-do
mai
nsw
ithin
clus
ters
.
Clu
ster
sde
fined
usin
gZ
-sco
reth
resh
olds
.Py
ram
idal
clas
sific
atio
nof
each
clus
ter
base
don
Z-s
core
s.So
me
prot
eins
may
belo
ngsi
mul
tane
ousl
yto
two
clas
ses.
Pyra
mid
alcl
assi
ficat
ion
tree
s(w
hich
are
not
phyl
ogen
etic
tree
s)an
dlis
tof
prot
eins
with
PRO
DO
Mba
sed
sub-
dom
ain
repr
esen
tatio
n.
naU
pdat
epr
otei
nda
taba
se;
com
pute
mis
sing
com
pari
sons
(old
byol
dan
dol
dby
new
);ap
ply
Z-s
core
cuto
ff;r
epea
tcl
uste
rde
tect
ion
and
pyra
mid
alcl
assi
ficat
ion.
Las
tdat
abas
eup
date
attim
eof
wri
ting:
July
2002
[ 61–
63]
Clu
STr
462,
000
(May
2004
)
Initi
ally
mam
mal
ian
and
plan
tpr
otei
nsin Sw
issP
rot
+ TrE
MB
Lan
den
trie
sof
3co
mpl
ete
euka
ryot
epr
otei
ns,
latte
rex
pand
edto U
niPr
ot(i
nclu
ding
195
com
plet
ege
nom
es)
SAR
Asu
perc
ompu
ter,
1024
-CPU
syst
emco
nsis
ting
oftw
o51
2-C
PUSG
IO
rigi
n38
00(o
new
ith32
to25
6C
PUpa
rtiti
on,t
heot
her
with
asi
ngle
512
CPU
part
ition
,nu
mbe
rof
proc
esso
rsw
asad
just
edde
pend
ing
onjo
bs,t
otal
run:
2m
onth
s)
naSW
Z-s
core
com
puta
tion
by10
0sh
uffli
ngof
each
alig
ned
sequ
ence
(200
shuf
fling
per
com
pari
son)
and
cons
erva
tion
ofth
em
inim
umZ
-sco
reob
tain
ed
Z-s
core
naA
lignm
ent
raw
para
met
ers
and
Z-s
core
Prag
mat
icor
gani
zatio
nof
data
/no
expl
icit
spat
ial
proj
ectio
n.U
seof
Z-s
core
tabl
efo
rsi
ngle
-lin
kage
clus
teri
ng.
Lin
ksof
resu
ltsw
ithkn
owle
dge
base
s,th
roug
hth
eIn
tegr
8pl
atfo
rm.
Hie
rarc
hica
lcl
uste
rsde
fined
usin
gZ
-sco
re.
Clu
ster
sar
ede
term
ined
by sing
le-l
inka
gem
etho
d.H
iera
rchy
isbu
iltin
crem
enta
lly.
Bin
ary
fore
stre
pres
entin
gth
ehi
erar
chy
ofcl
uste
rs,i
nw
hich
each
pare
ntcl
uste
rha
son
lytw
ocl
uste
rs.
Prun
ing
ofcl
uste
rsfo
llow
ing
ase
tof
rule
s(e
.g.e
xclu
sion
ofa
clus
ter
whe
nits
mem
bers
form
≥90
%of
itspa
rent
set,
sing
leto
ns).
Obt
aine
dcl
uste
rsar
eco
llect
edin
Clu
STr
Slim
.
Upd
ate
prot
ein
data
base
;co
mpu
tem
issi
ngco
mpa
riso
ns(o
ldby
old
and
old
byne
w);
appl
yZ
-sco
recu
toff
;rep
eat
clus
ter
dete
ctio
nan
dhi
erar
chy
clas
sific
atio
n.
Las
tupd
ate
attim
eof
wri
ting:
May
2004
.M
argi
nal
upda
teof
data
base
with
Uni
Prot
entr
ies
isan
noun
ced
tobe
bi-w
eekl
yat
Clu
STr
inte
rnet
port
al.
[ 64–
67]
na:n
otav
aila
ble
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 419
Fig. 1. Massive all-by-all protein sequence comparison and protein spacereduction workflow. From a protein sequence database, a massive comparisonis carried out using an amino acid scoring matrix, an alignment method and acorresponding statistical model. The output of the comparison is a table thatis used for the reduction of the initial protein diversity at the sequence level,following pragmatic rules. Obtained clusters can be viewed as lists, graphs, orclassification trees which are not phylogenetic trees. According to the statisticalmodel used to estimate the accuracy of alignment scores, the update of theinitial protein sequence database can imply a complete re-computation of themassive all-by-all pairwise alignment in order to update the clusters.
interacting with families of drug scaffolds. It can therefore be amilestone toward a structured definition of a referential proteinspace for phylogenomics and chemogenomics. We evaluatedthe possible use of the CSHP projection to map and navigateinto the protein space using current massive protein comparisonoutputs, and we propose guidelines for future generations ofmassive protein sequence comparison projects.
2. Virtual protein sequence databases
Most of the millions of virtual protein sequences deducedfrom DNA automatic sequencing, and the millions to come,
will not be experimentally confirmed, neither their functiondirectly analyzed. Electronic storage of genes is scatteredin specific databases (e.g. GDB for Human [10], TAIR forArabidopsis [11], PlasmoDB for Plasmodium [12], etc.) andgathered in three major generalist repositories (e.g. EMBL,at the European Bioinformatic Institute [13], GenBank atthe National Center for Biotechnology Information [14] andDDBJ at the DNA Data Bank of Japan [15]). Institutes incharge of these databases, aware of the risk of ending withan inextricable labyrinth with redundancy and inconsistencies,unified the storage of gene entries and exchange their datadaily [16]. Coexisting protein databases with differing coverageand priorities were recently unified. Briefly, the PIR–PSD [17]and Swiss–Prot databases [18] contain the best characterizedsequences, with detailed and manually curated annotationsof protein function; they represent therefore a reference thebiologists enrich at their own pace. Complementarily, TrEMBL(translation of EMBL nucleotide sequence database) consistsof computer-annotated entries derived from the translation ofall coding sequences in the nucleotide sequence databases,except for entries already referenced in Swiss–Prot [18]. Acentralized authoritative resource, UniProt, has been formed byuniting PIR–PSD, Swiss–Prot and TrEMBL databases [4]. Theproportion of protein sequences with high-quality functionalannotation in the complete virtual protein sequence set was∼13% in 2003, ∼10% in 2004 and lower than 8% in 2005.This proportion will fall below 1% in 2008. Given the cost andlow throughput of experimental protein characterization and ofmanual curation of sequence annotations, our exploration of>92% of the protein space (>99% in 3 years) relies on ourability to extrapolate the most accurate navigation maps in theprotein space.
3. Sequence alignment score measures, optimization meth-ods and statistics for a homology-based reduction of theprotein space diversity
Given a new protein sequence, the widely used methodto start with is a homology-based annotation transfer, i.e. thetransfer of portions of knowledge from related sequences storedin databases, on the basis of a suspected common evolutionaryorigin. When homologues are detected in distinct organisms,they are termed orthologues and when they are suspected toderive from gene duplication inside a given organism, theyare termed paralogues. Sequence homology is assessed usingdynamic programming techniques similar to those developedto calculate the edit distances between strings. In all cases,the underlying rationale for homology search at the sequencelevel relies on a fundamental postulate that can be stated as:“the closer in the evolution, the more alike and conversely,the more alike, probably the closer in the evolution”. Genomeannotations based on homology, performed by numerous teamswithout any standard workflow, ended with a mild resultsince ∼25% of the virtual protein sequences have no knownhomologues [5], a figure that reaches ∼60% in some organisms,such as Plasmodium falciparum [19]. Paucity of detectedhomologues may be either due to a real lack of homologues
420 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
or a technical failure (false negatives). Selection of scoringmeasures, sequence alignment methods and statistics for amost accurate homology-based reduction of the protein spacediversity should therefore be carefully examined before oneachieves a massive comparison.
Firstly, alignments are obtained by maximizing (orminimizing) a quantity named “score”, determined at the levelof aligned amino acids. To that purpose, scoring matricesare used to weight and sum scores of compared aminoacids and find optimal alignments, computed with a dynamicprogramming procedure. Scoring matrices have been foundto be similarity matrices as well [20,21]. Many scoringmatrices are available [22–25] and evaluation studies led tothe conclusion that those based on a log-odds ratio, likeBLOSUM [23], outperformed the others [26]. BLOSUM wascomputed using blocks of aligned sequences with:
s(i, j) =1λ
log(
qi j
pi p j
)where i and j are aligned amino acids, s(i, j) the score,qi j the frequency of the observation: “i is aligned with j”,i.e. the target frequency, pi and p j the background frequenciesof i and j respectively, and λ a scaling factor. We furtherdemonstrated [25] that the score was also related to the mutualinformation in the sense of Hartley [8,9], between the twoconsidered amino acids:
s(i, j) =1λ
I (i; j)
where I is the mutual information between events, i.e. thereduction of the uncertainty of event i due to the knowledgeof j . Massive comparisons using scoring matrices thus definedallow therefore the extraction of mutual information sharedby sequences. The postulate for homology-based annotationtransfer is therefore supported by information theory [8,9]and is re-formulated accordingly as: “Given two homologousproteins, the closer in the evolution, the greater their mutualinformation at the sequence level and conversely, the greater themutual information at the sequence level, probably the closerin the evolution”. Thus selection of the amino acid scoringmatrix can influence the quality of the information one cantransfer, and therefore the accuracy of the homology-basedreduction.
Secondly, few algorithms exist to optimize the protein se-quence alignments, being either global (at the full-length se-quence level) or local. Global alignment algorithms [27] arenot accurate to assess homology of domains in modular pro-teins [21]; local alignments are better suited [28–30]. The SWdynamic algorithm [28] is considered exact, but its computa-tional cost is too high for traditional computing resources, evenfor small samples of sequences. Heuristic approaches were suc-cessful in speeding up alignment processes, the most popularbeing the low CPU demanding BLASTP [30] and FASTA [29]algorithms. BLASTP proved to be efficient from routine com-parisons of small samples of proteins directly handled by benchbiologists, to batch comparisons of large numbers of sequencesmonitored by bioinformaticians. Rank in computation speed,
i.e. SW < FASTA < BLASTP, is considered the reverse to thatof accuracy. The BLASTP, FASTA and SW methods were suc-cessfully parallelized and Grid-enabled (e.g. [31–33]).
Thirdly, confidence in alignment results is estimatedaccording to statistical criteria. It is important to notethat an alignment algorithm comes with a statistical modelimplemented in the code, particularly in the BLAST package,and it is therefore difficult to discuss the current view onsequence comparison methods and statistics independently.Two major statistical models are currently used to testalignment scores. The most common test is an estimate ofthe E-value (short for Expectation value), i.e. the number ofalignments one expects to find in the database by chance,with equivalent or better scores. It can be determined fromthe complete distribution of scores. The BLASTP associatedstatistics defined by Karlin and Altschul [34] are based onthe probability of an observed local alignment score accordingto an extreme value distribution. The number of high-scoringsequence matching regions is estimated above a threshold by aPoisson distribution and allows the computation of a P-value,that the score could have occurred by chance, related to theE-value by the formula:
E = log(1/1 − P).
The E-values can therefore be computed based on thescore distribution or on the Karlin–Altschul model (seeTable 1). Validity of the Karlin–Altschul E-value computationmodel (and subsequent improvements) requires two restrictiveconditions: first, individual residue distributions for the twosequences should not be ‘too dissimilar’ and second, sequencelengths ‘should grow at roughly equal rates’ [34]. Validityrestrictions listed here are fully acceptable when dealingwith protein sequences of average lengths and amino aciddistribution, and BLASTP is a good compromise when havingaccess to limited CPU power. However, compositionallybiased genomes such as that of Plasmodium falciparum falloutside of the validity domain for a BLASTP comparisonwith unbiased sequences [35,36]. Based on a BLASTPsemi-automatic annotation procedure, around 60% of thePlasmodium sequences did not have any apparent homologywith sequences from other genomes [19] although suchproteomic uniqueness appeared doubtful. In addition, thedependence of the P-value calculation on the data bank sizeimplies that results may fluctuate through updating, which isnot compatible with the construction of a stable reference.
An alternative method to assess the relevance of analignment was introduced by Lipman and Pearson [37]. Ituses the Monte Carlo techniques to investigate the significanceof a given score calculated from the alignment of two realsequences a and b. It can be used to sort results obtainedby any comparison methods including BLASTP, althoughthis has not yet been achieved at a massive scale. It iscurrently used to estimate the probabilities of SW comparisons.The method consists in computing alignments of shuffledsequences from a and b [38]; the variables corresponding tothe shuffled sequences are termed a∗ and b∗ respectively. Thesecomparisons allow the estimate of an empirical mean score (µ)
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 421
and standard deviation (σ ) from the distribution of the randomvariable S(a∗, b∗). The Z-value is then defined as:
Z =s(a, b) − µ
σ.
For the computation of a Z-value between two sequences,only one of the two sequences is generally shuffled [39]. Forpractical reasons, the Z-value was reformulated according toa conservative principle, that is the minimum of Z(a, b∗) andZ(a∗, b) [40]. Alternatively, Z-value can be estimated by anaverage of Z(a, b∗) and Z(a∗, b) (see Table 2). Computationof Z(a, b) is known to be convergent and depends on theaccuracy of the estimation of µ and σ , and therefore on thenumber of shuffling, ranging from 100 to 1000 [35,39,41]. Theasymptotic law of Z-value was shown to be independent ofsequence length and amino acid distribution [40]. The estimateof the Z-value was additionally dependent on the shufflingmethod because the shuffling procedure respects the sequencecomposition but breaks down biological structures [40]. Wedemonstrated the TULIP theorem (theorem of the upper limitof a score probability, [35]) assessing that Z-values can be usedas a statistical test, a single-linkage clustering criterion and that1/Z-value2 was an upper limit to the probability of an alignmentscore whatever the actual probability law was. From the TULIPtheorem and corollaries [7,35], the comparison of a protein toa given reference sequence a, weighed by an alignment score,is characterized by a bounded probability that the alignment isobtained by chance.
In practice, a Z-value table can be analyzed using theTULIP theorem to detect pairs of proteins that are probablyhomologues following a Z-value confidence cutoff. Forinstance, a Z-value above 10 allows an estimate that thealignment is significant with a statistical risk of 1/Z-value2,i.e. 0.01.
The TULIP theorem provides therefore sustained ar-guments in favor of the Lipman–Pearson model to es-timate an alignment probability for massive comparisons;in particular the two restrictions of the Karlin–Altschulmodel, i.e. that the amino acid distributions of com-pared sequences should not be too dissimilar and thattheir lengths should be relatively close, are not required.In addition, Z-values are completely independent of thealignment length and are therefore normalized values, andZ-value statistics are independent of the databank size andtherefore stable after each database update. Eventually, up-date is made easy since the simple collection of {new}-by-{old} and {new}-by-{new} is sufficient to complete theZ-value table obtained from an existing all-by-all comparison(Figs. 2 and 3).
Tables 1 and 2 shows a comparative record of protein all-by-all massive comparisons that have been performed to organize,map, and reduce the diversity of the protein space. Whenavailable, information on the computing resource (mostly gridsand supercomputers) was provided. Table 1 describes projects(COG [42–45], Tribe [46,47], ProtNet [48–50], ProtoMap [51–53], SIMAP [54,55], SYSTERS release 4 [56–58]) in whichalignment significance was assessed from E-value estimates
that are less CPU-demanding than Z-value computations ina single massive comparison experiment. The handling ofthe output n × n table of E-values requires pragmatic post-processing normalization, including asymmetric correctionsof E-values obtained after permutation of the two alignedsequences (e.g. Tribe), or consensus E-value computationafter alignment with different algorithms (e.g. ProtoNet). Inall cases, there is no theoretical support to justify that anE-value table can be converted into a rigorous and stablemetric. The E-value table can be converted into a Markovmatrix (e.g. Tribe, SIMAP), or a close graphic equivalent,i.e. graphs connecting protein entries with E-values as weightsfor graph edges (e.g. COG, ProtoMap, SYSTERS). The proteinsets are organized either by detecting graphs and sub-graphsfollowing pragmatic rules, with granularities depending onE-value thresholds, or by distance clustering using E-value asa pseudo-metrics, or by Markov-random-field clustering. Noneof the obtained organization of the protein sequences can benamed a spatial projection and none of the obtained clusterscan be represented as a phylogenetic reconstruction. Eventually,the economy of computing E-values in an all-by-all comparisonexperiment is not gained in the updating process that requires acomplex pipeline or a complete re-calculation.
Table 2 describes projects (Decrypthon [59], TeraProt [60],PhytoProt [61–63], CluSTr [64–67]) in which alignment sig-nificance was assessed from Z-value estimates. All compu-tations were undergone with SW. It is possible to computescores and Z-values with other alignment algorithms such asBLASTP, but it is very likely that this was not done sim-ply because E-value calculation is directly implemented inthe BLASTP code, and because Z-value computation withBLASTP is not yet coded. Handling of the output n × ntable does not require in-depth processing and is compli-ant with a straightforward updating process, i.e. {new}-by-{new} and {new}-by-{old} computations. Because Z-valuesare sufficient to assess an alignment significance [7,35], re-sults are provided either as raw lists one can sort followingZ-values (Decrypthon, TeraProt), or as clusters using Z-valuesas single linkage criterion (ClusTr) or more sophisticatedclusters obtained following pyramidal classification (Phyto-Prot). The ClusTr comparison provided a hierarchical proteinspace map for Integr8, the integrated functional and moleculardatabase/knowledgebase of the EBI [65–67]. Although Z-valuecomputation has a high initial CPU-cost, the statistic validityand easy update of the Z-value table are arguments support-ing that future massive comparisons using grid and supercom-puter power at best should get inspired by this first generationof projects listed in Table 2.
From this overview of alignment score measures, optimiza-tion methods and statistics, the methods of choice for a reduc-tion of the virtual proteome from an all-by-all sequence com-parison seem to include the use of log odds ratio-derived aminoacid scoring matrices (such as BLOSUM), the exact and CPUdemanding SW algorithm, and statistics based on Z-values.This workflow is summarized in Figs. 2 and 3.
422 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Fig. 2. Massive all-by-all protein sequence comparison and protein space reduction workflow, based on Z-value statistics and protein projection in the CSHP. Froma protein sequence database, a massive comparison is carried out using an amino acid scoring matrix such as BLOSUM, the SW alignment method and the TULIPstatistical model based on Z-values (requiring Monte Carlo simulation). The output of the comparison is a table containing SW scores and Z-values that are used forthe reduction of the initial protein diversity at the sequence level, following the rigorous CSHP spatial projection. Obtained clusters can be viewed as phylogeneticTULIP trees.
4. The configuration space of homologue proteins (CSHP)and its phylogenetic topology, based on pair-wise Z-valueprobabilities
We introduced a representation of protein sequences usingpair-wise alignments in which Z-values are computed andstored [7]. In a set of n homologous proteins, any sequence acan be selected as a reference in respect to which the n − 1others are compared. Such a geometric representation ofobjects relative to a fixed frame is known as a configurationspace (CS), and named accordingly the configuration spaceof homologous proteins or CSHP. In a set of n homologousproteins, it is therefore possible to define n referencesfor the CSHP by permuting the sequences considered asreferential. An interesting property of the CSHP derives fromthe demonstration that sequence similarity computed for an
alignment is an expression of the mutual information sharedby protein sequences, as stated in the information theory [8,9]. The CSHP is conservative for mutual information, in theway physical spaces can be conservative for metrics, forcesor energy. Mutual information with a referential sequence a issufficient for the full positioning of the n − 1 homologues, andthe positioning of proteins that share some homology with thereference is unambiguous, unique, and unaltered when proteinsare added or removed.
In the CSHP, features separating objects, i.e. the sequences,can be used to compute a probabilistic expression ofproximity based on Z-values [7], out of which one candeduce a divergence time t , following assumptions derivedfrom an evolutionary model [68]. For instance, given aand b two homologous sequences, the simple molecularclock hypothesis [69] supposes that t is a measure of the
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 423
Fig. 3. Updating process for a massive all-by-all protein sequence comparison based on Z-value statistics and protein projection in the CSHP. According to thestatistical model used to estimate the accuracy of alignment scores, the update of the initial protein sequence database {new} implies that a simple {new}-by-{new}
and {new}-by-{old} computation is sufficient in order to update the clusters.
transmutation of a and b as a consequence of a Poissonprocess. In the CSHP, an evolutionary distance is given by thedivergence observed between two sequences knowing that theyshare some features (the observed sequences a and b) and thatthey were identical before the speciation event (an unknownancestral sequence u). Considering b∗a shuffled sequence ofb, then the probability pid/a that b∗ shares identity with a,knowing that the proximity between b∗ and a is lower than thatbetween b and a, was shown to be bounded above according tothe following formula:
pid/a(b∗) ≤
(s(a,b)−µ1
σ1
)2
(s(a,a)−µ2
σ2
)2
where s(a, a) and s(a, b) are the pair-wise optimal alignmentscores of a with a and a with b, and µ1, and σ1, µ2 and σ2 arethe mean and the standard deviation of S(a, b∗) and S(a, a∗)
respectively. Using the Poisson correction, an expression oft (a, b) is given as the linear combination of the two correctionsof the evolutionary distances deduced from pid/a and pid/b:
t (a, b) = −[log(pid/a(b∗)) + log(pid/b(a∗))]
with a∗ and b∗ the random variables corresponding to theshuffled sequences of a and b respectively. The sum of thelogarithms corresponds to the product of the two probabilities,an expression of the hypothesis of independence of lineage.Thus, t (a, b) appears as a function of Z-value ratios.
For any set of homologous proteins, it is therefore possibleto measure a table of pair-wise divergence times and build
424 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
Fig. 4. Exploration of the protein space based on the TeraProt sequence massive comparison and protein projection in the CSHP. Given a UniProt protein entry(here, Q9SM44) we extracted sequences from the TeraProt output table that could be aligned with the query with a Z-value > 30. This set contains plant proteins(Q9FZL4, O81770, Q9S193, Q9FZL3, P93115) that have the same function as the query (MGD, i.e. catalyzing the transfer of a galactose from a UDP-galactosedonor onto a lipidic diacylglycerol acceptor) and bacterial proteins (Q8XMH5, Q99V75, Q8NXC3, YPFP BACSU, Q8XIB5, Q97F56, Q9KBH0) having a distinctbut related activity (MURG, i.e. catalyzing the transfer of an N-acetyl-glucosamine from a UDP-N-acetyl-glucosamine donor onto the Lipid 1 acceptor). (a) TULIPtree computed from the Z-value table obtained from the TeraProt project (Monte Carlo simulation with 100 sequence shuffling; amino acid scoring matrix: PAM10). This tree highlights the clear evolutionary divergence between plant and bacterial sequences. (b) TULIP tree obtained after a re-computation of the Z-valuetable with 100 sequence shuffling and the BLOSUM 62 amino acid scoring matrix. (c) TULIP tree obtained after a re-computation of the Z-value table with 2000sequence shuffling and the BLOSUM 62 amino acid scoring matrix. The TULIP tree constructed from the TeraProt database was obtained after <15 s using abench workstation (HP ProLiant G4 Bi-Xeon—2.8 GHz) whereas trees reconstructed with 100 shuffling required 6 min computation and reconstruction with 2000shuffling required less than 2 h. This example illustrates the power of the method and the benefit one can get by exploring massive comparison results using theCSHP projection and TULIP phylogenetic underlying topology.
phylogenetic trees using distance methods (see Fig. 2). Thesetrees were called TULIP trees. TULIP trees were comparedto phylogenetic trees built using conventional methods, forinstance the popular PHYLIP [70] or PUZZLE [71] methodsbased on multiple sequence alignments. TULIP trees provedto perform as well in any unbiased sets of proteins. Moreover,some phylogenetic inconsistencies in trees built with multiple-alignment based methods, particularly including subsets ofcompositionally biased sequences, or with low bootstrappingvalues, could be spectacularly solved with the TULIP tree[7,72].
An advantage of the phylogenetic inference from the CSHPover that obtained from multiple alignments lies precisely in theTULIP tree construction from pair-wise alignments. Whereasthe addition or removal of a sequence can deeply alter themultiple alignment result, and the deduced phylogeny, theZ-value and divergence time tables that serve to reconstruct thephylogenetic trees from the CSHP are the result of a MonteCarlo simulation, which is a converging process at the level ofthe pair-wise comparison and is not altered by database updates.As a result, whereas a phylogenetic database computed frommultiple alignments would require a complete and increasingcomputation for any update, the TULIP tree calculation simplyrequires the calculation of the {new}-by-{old} and {new}-by-{new} Z-values and divergence times (Fig. 3). The CSHP andits assigned Darwinian phylogenetic topology is therefore atheoretical frame of choice for the treatment of results ofmassive pairwise protein sequence comparisons.
5. CSHP projection and TULIP tree construction fromexisting massive comparison tables
Can the CSHP be used to obtain a spatial projection ofproteins and detect clusters using existing massive comparisontables? Can the Darwinian topology assigned to the CSHP beused to return phylogenetic trees (TULIP trees) from theseCSHP projections? We downloaded complete results of oneof the projects listed in Table 2, i.e. the complete release ofTeraProt, which output was made publicly available as lists,one can sort using alignment parameters including Z-values.Thus, given a protein entry, we can extract the complete listof sequences which alignment Z-value was above a definedcutoff. The subsequent projection of this set of sequences inthe CSHP requires the complete table of pairwise Z-values.Because of SW score cutoff used in the TeraProt workflow(see Table 2), these required Z-values may not have been allcollected, and the table may contain holes. A default Z-valueshould therefore be introduced to circumvent this missing data,or the corresponding pairs of sequences should be discarded asthe missing alignment may reflect a non-transitive homology.
Fig. 4a shows an example of a TULIP tree obtainedwith sequences that were aligned with a spinach protein(MGD1, UniProt reference Q9SM44) with a Z-value above 30,computed after 100 sequence shuffling. This set contains plantproteins that have the same function (MGD, i.e. catalyzing thetransfer of a galactose from a UDP-galactose donor onto alipidic diacylglycerol acceptor) and bacterial proteins having
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 425
a distinct but related activity (MURG, i.e. catalyzing thetransfer of an N-acetyl-glucosamine from a UDP-N-acetyl-glucosamine donor onto the Lipid 1 acceptor). The obtainedcluster is consistent with the common origin of MGD andMURG sequences and their 3D structural similarity [72].The TULIP tree highlights the clear evolutionary divergencebetween these groups of sequences. Fig. 4b and c show theTULIP trees obtained after a re-computation of the Z-valuetable with 100 and 2000 sequence randomizations. Althoughthe accuracy of the tree is improved, the reliability of theinitial clusters and derived phylogeny shown in Fig. 4a, asestimated from the refined tree shown in Fig. 4b, illustrates thatthe information collected in the original massive comparisoncould be extracted with a good level of confidence. The TULIPtree constructed from the TeraProt database was obtained after<15 s using a bench workstation (HP ProLiant G4 Bi-Xeon—2.8 GHz) whereas trees reconstructed with 100 shufflingrequired 6 min computation and reconstruction with 2000shuffling required less than 2 h using the bench workstation.This example illustrates the power of the method and thebenefit one can get by exploring massive comparison resultsusing the CSHP projection and TULIP phylogenetic underlyingtopology.
Because previous massive comparisons based on Z-valuestatistics, such as TeraProt, were not specifically designedto be processed following a CSHP projection and TULIPphylogenetic reconstruction, we noticed that the outputZ-value table could not be fully subjected to this post-analyticaltreatment. On the one hand, multiple occurrences of someidentical sequences introduced redundant Z-values had to befiltered and, on the other hand, missing Z-values had to becorrected by default values. We are currently designing aninternet-portal allowing the exploration of protein space(s)based on curated massive comparison tables.
6. Conclusion
The proportion of protein sequences in public databases thatwill be of high-quality annotation will fall below 1% in less than3 years, and the total number of proteins in the virtual proteomewill soon exceed 10 million. The need for a standard theoreticaland practical projection of the protein space is urgent. Thereduction of the virtual proteome from an all-by-all sequencecomparison seem to include the use of odds ratio-derived aminoacid scoring matrices (such as BLOSUM), the exact and CPUdemanding SW algorithm, and statistics based on Z-values.One of our future goals is to compare the performance andaccuracy of BLASTP and SW based on Z-value statistics,which might be determinant in selecting the alignment methodfor future massive comparison projects. The use of Z-valuesis extremely important to embrace the diversity of the proteinworld, and particularly proteins that diverge from the averageamino acid distribution (compositionally biased [35,36]). TheCSHP is obtained by a rigorous projection of proteins,respecting biological constraints and being conservative formutual information shared by sequences [7]. It can be assigneda topology based on a protein evolution model that allows a
construction of phylogenetic trees, named TULIP trees [7].TULIP trees have numerous advantages over current clusteringmethods, including their stability and explicit Darwinianconstruction that makes sense to a bench biologist. The criticalneed of evolutionary consistent protein sequence navigationmaps and the lessons one can learn from the analysis of theprevious massive comparisons, presented in this paper, led usto investigate how the CSHP projection could help in providinga theoretically robust and practical solution. In this paper,we showed that the CSHP projection can be used to processthe outputs of current massive comparison projects (TeraProtresults can be explored at the TULIP 1.1 server that willsoon be launched, but the CSHP projection and TULIP treecluster representation can also be an important improvementfor other Z-value based systematic massive comparisonssuch as ClusTr) given minor corrections for uncollected lowZ-values in the database. The use of massive comparisonoutputs are not restricted to phylogenic explorations, and canhelp automatic massive annotations by homology transfers,which require other clustering methods, including speciesspecific occurrence profiles. The CSHP projection is thereforeproposed as a sequence-based underlying organization ofsequences that should be linked to other protein classification(particularly based on sub-domains and available tertiarystructures) and knowledge bases.
References
[1] M. Margulies, M. Egholm, W.E. Altman, S. Attiya, J.S. Bader, L.A.Bemben, J. Berka, M.S. Braverman, Y.J. Chen, Z. Chen, S.B. Dewell,L. Du, J.M. Fierro, X.V. Gomes, B.C. Godwin, W. He, S. Helgesen, C.H.Ho, G.P. Irzyk, S.C. Jando, M.L. Alenquer, T.P. Jarvie, K.B. Jirage, J.B.Kim, J.R. Knight, J.R. Lanza, J.H. Leamon, S.M. Lefkowitz, M. Lei, J. Li,K.L. Lohman, H. Lu, V.B. Makhijani, K.E. McDade, M.P. McKenna, E.W.Myers, E. Nickerson, J.R. Nobile, R. Plant, B.P. Puc, M.T. Ronan, G.T.Roth, G.J. Sarkis, J.F. Simons, J.W. Simpson, M. Srinivasan, K.R. Tartaro,A. Tomasz, K.A. Vogt, G.A. Volkmer, S.H. Wang, Y. Wang, M.P. Weiner,P. Yu, R.F. Begley, J.M. Rothberg, Genome sequencing in microfabricatedhigh-density picolitre reactors, Nature 437 (7057) (2005) 376–380.
[2] An example of an in silico docking can be accessed at http://wisdom.eu-egee.fr/.
[3] H.J. Feldman, M. Dumontier, S. Ling, N. Haider, C.W. Hogue, CO: Achemical ontology for identification of functional groups and semanticcomparison of small molecules, FEBS Lett. 579 (21) (2005) 4685–4691.
[4] A. Bairoch, R. Apweiler, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro,E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Martin, D.A. Natale,C. O’Donovan, N. Redaschi, L.S. Yeh, The universal protein resource(UniProt), Nucleic Acids Res. 33 (Database issue) (2005) D154–D159.
[5] Y. Ofran, M. Punta, R. Schneider, B. Rost, Beyond annotation transferby homology: Novel protein-function prediction methods to assist drugdiscovery, Drug Discov. Today 10 (21) (2005) 1475–1482.
[6] J. Liu, B. Rost, Domains, motifs and clusters in the protein universe, Curr.Opin. Chem. Biol. 7 (1) (2003) 5–11.
[7] O. Bastien, P. Ortet, S. Roy, E. Marechal, A configuration space ofhomologous proteins conserving mutual information and allowing aphylogeny inference based on pair-wise Z-score probabilities, BMCBioinformatics 6 (1) (2005) 49.
[8] R.V.L. Hartley, Transmission of information, The Bell Syst. Tec. J. 3(1928) 535–564.
[9] C.E. Shannon, A mathematical theory of communication, The Bell Syst.Tec. J. 27 (1948) 379–423.
[10] Access to the genome database for Human at http://gdbwww.gdb.org/.
426 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427
[11] Access to the genome database for Arabidopsis athttp://www.arabidopsis.org/.
[12] Access to the genome database for Plasmodium athttp://www.plasmodb.org/.
[13] EMBL genome sequence repository at the European BioinformaticInstitute web service: http://www.ebi.ac.uk/embl/.
[14] GenBank genome sequence repository at the National Center for Biotech-nology Information web service: http://www.ncbi.nlm.nih.gov/entrez/.
[15] DDBJ genome sequence repository at the DNA Data Bank of Japan:http://www.ddbj.nig.ac.jp/.
[16] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler,GenBank, Nucleic Acids Res. 33 (Database issue) (2005) D34–D38.
[17] C.H. Wu, L.S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen,Z. Hu, P. Kourtesis, R.S. Ledley, B.E. Suzek, C.R. Vinayaka, J. Zhang,W.C. Barker, The protein information resource, Nucleic Acids Res. 31 (1)(2003) 345–347.
[18] B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher,E. Gasteiger, M.J. Martin, K. Michoud, C. O’Donovan, I. Phan, S.Pilbout, M. Schneider, The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003, Nucleic Acids Res. 31 (1) (2003) 365–370.
[19] M.J. Gardner, N. Hall, E. Fung, O. White, M. Berriman, R.W. Hyman,J.M. Carlton, A. Pain, K.E. Nelson, S. Bowman, I.T. Paulsen, K. James,J.A. Eisen, K. Rutherford, S.L. Salzberg, A. Craig, S. Kyes, M.S. Chan,V. Nene, S.J. Shallom, B. Suh, J. Peterson, S. Angiuoli, M. Pertea,J. Allen, J. Selengut, D. Haft, M.W. Mather, A.B. Vaidya, D.M. Martin,A.H. Fairlamb, M.J. Fraunholz, D.S. Roos, S.A. Ralph, G.I. McFadden,L.M. Cummings, G.M. Subramanian, C. Mungall, J.C. Venter, D.J.Carucci, S.L. Hoffman, C. Newbold, R.W. Davis, C.M. Fraser, B. Barrell,Genome sequence of the human malaria parasite plasmodium falciparum,Nature 419 (6906) (2002) 498–511.
[20] M.S. Waterman, Introduction to Computational Biology: Maps, Se-quences, and Genomes, CRC Press, 1995.
[21] J. Setubal, J. Meidanis, Introduction to Computational Molecular Biology,PWS Publishing Company, Boston, 1997.
[22] M.O. Dayhoff, W.C. Barker, L.T. Hunt, Establishing homologies inprotein sequences, Method Enzymol. 91 (1983) 524–545.
[23] S. Henikoff, J.G. Henikoff, Amino acid substitution matrices from proteinblocks, Proc. Natl Acad. Sci. USA 89 (22) (1992) 10915–10919.
[24] J.L. Risler, M.O. Delorme, H. Delacroix, A. Henaut, Amino acidsubstitutions in structurally related proteins. A pattern recognitionapproach. Determination of a new and efficient scoring matrix, J. Mol.Biol. 204 (1988) 1019–1029.
[25] O. Bastien, S. Roy, E. Marechal, Construction of non-symmetricsubstitution matrices derived from proteomes with biased amino aciddistributions, C. R. Biol. 328 (5) (2005) 445–453.
[26] S. Henikoff, J.G. Henikoff, Performance evaluation of amino acidsubstitution matrices, Proteins 17 (1) (1993) 49–61.
[27] S.B. Needleman, C.D. Wunsch, A general method applicable to the searchfor similarities in the amino acid sequence of two proteins, J. Mol. Biol.48 (3) (1970) 443–453.
[28] T.F. Smith, M.S. Waterman, Identification of common molecularsubsequences, J. Mol. Biol. 147 (1) (1981) 195–197.
[29] W.R. Pearson, D.J. Lipman, Improved tools for biological sequencecomparison, Proc. Natl Acad. Sci. USA 85 (8) (1988) 2444–2448.
[30] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic localalignment search tool, J. Mol. Biol. 215 (3) (1990) 403–410.
[31] E. Glemet, J.J. Codani, LASSAP, a LArge Scale Sequence compArisonPackage, Comput. Appl. Biosci. 13 (2) (1997) 137–143.
[32] T. Rognes, E. Seeberg, Six-fold speed-up of Smith–Waterman sequencedatabase searches using parallel processing on common microprocessors,Bioinformatics 16 (8) (2000) 699–706.
[33] A. YarKhan, J. Dongarra, Biological sequence alignment on thecomputational grid using the GrADS framework, Future Gener. Comput.Syst. 21 (6) (2005) 980–986.
[34] S. Karlin, S.F. Altschul, Methods for assessing the statistical significanceof molecular sequence features by using general scoring schemes, Proc.Natl Acad. Sci. USA 87 (6) (1990) 2264–2268.
[35] O. Bastien, J.C. Aude, S. Roy, E. Marechal, Fundamentals ofmassive automatic pairwise alignments of protein sequences: Theoreticalsignificance of Z-value statistics, Bioinformatics 20 (4) (2004) 534–537.
[36] O. Bastien, S. Lespinats, S. Roy, K. Metayer, B. Fertil, J.J. Codani,E. Marechal, Analysis of the compositional biases in plasmodiumfalciparum genome and proteome using Arabidopsis thaliana as areference, Gene 336 (2) (2004) 163–173.
[37] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein similaritysearches, Science 227 (4693) (1985) 1435–1441.
[38] W.M. Fitch, Random sequences, J. Mol. Biol. 163 (2) (1983) 171–176.[39] J.P. Comet, J.C. Aude, E. Glemet, J.L. Risler, A. Henaut, P.P. Slonimski,
J.J. Codani, Significance of Z-value statistics of Smith–Waterman scoresfor protein alignments, Comput Chem. 23 (3–4) (1999) 317–331.
[40] J.N. Bacro, J.P. Comet, Sequence alignment: An approximation law forthe Z-value with applications to databank scanning, Comput Chem. 25 (4)(2001) 401–410.
[41] J.C. Aude, A. Louis, An incremental algorithm for Z-value computations,Comput Chem. 26 (5) (2002) 403–411.
[42] R.L. Tatusov, E.V. Koonin, D.J. Lipman, A genomic perspective onprotein families, Science 278 (5338) (1997) 631–637.
[43] R.L. Tatusov, D.A. Natale, I.V. Garkavtsev, T.A. Tatusova, U.T.Shankavaram, B.S. Rao, B. Kiryutin, M.Y. Galperin, N.D. Fedorova,E.V. Koonin, The COG database: New developments in phylogeneticclassification of proteins from complete genomes, Nucleic Acids Res. 29(1) (2001) 22–28.
[44] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin, E.V.Koonin, D.M. Krylov, R. Mazumder, S.L. Mekhedov, A.N. Nikolskaya,B.S. Rao, S. Smirnov, A.V. Sverdlov, S. Vasudevan, Y.I. Wolf, J.J. Yin,D.A. Natale, The COG database: An updated version includes eukaryotes,BMC Bioinformatics 4 (2003) 41.
[45] Access to COG protein clusters at http://www.ncbi.nlm.nih.gov/COG/.[46] A.J. Enright, V. Kunin, C.A. Ouzounis, Protein families and TRIBES in
genome sequence space, Nucleic Acids Res. 31 (15) (2003) 4632–4638.[47] TribeMCL downloadable at http://www.ebi.ac.uk/research/cgg/tribe/.[48] O. Sasson, A. Vaaknin, H. Fleischer, E. Portugaly, Y. Bilu, N. Linial,
M. Linial, ProtoNet: Hierarchical classification of the protein space,Nucleic Acids Res. 31 (1) (2003) 348–352.
[49] N. Kaplan, O. Sasson, U. Inbar, M. Friedlich, M. Fromer, H. Fleischer,E. Portugaly, N. Linial, M. Linial, ProtoNet 4.0: A hierarchicalclassification of one million protein sequences, Nucleic Acids Res. 33(Database issue) (2005) D216–D218.
[50] Access to ProtoNet massive comparison output athttp://www.protonet.cs.huji.ac.il/.
[51] G. Yona, N. Linial, M. Linial, ProtoMap: Automatic classification ofprotein sequences and hierarchy of protein families, Nucleic Acids Res.28 (1) (2000) 49–55.
[52] G. Yona, N. Linial, M. Linial, ProtoMap: Automatic classification ofprotein sequences, a hierarchy of protein families, and local maps of theprotein space, Proteins 37 (3) (1999) 360–378.
[53] Access to ProtoMap protein clusters at http://protomap.cornell.edu/.[54] R. Arnold, T. Rattei, P. Tischler, M.D. Truong, V. Stumpflen, W. Mewes,
SIMAP—The similarity matrix of proteins, Bioinformatics 21 (Suppl. 2)(2005) ii42–ii46.
[55] Access to SIMAP massive comparison out-put at http://mips.gsf.de/services/analysis/simap/ andhttp://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl.
[56] A. Krause, J. Stoye, M. Vingron, Large scale hierarchical clustering ofprotein sequences, BMC Bioinformatics 6 (1) (2005) 15.
[57] A. Krause, M. Vingron, A set-theoretic approach to database searchingand clustering, Bioinformatics 14 (5) (1998) 430–438.
[58] Access to SYSTERS V4 protein clusters at http://systers.molgen.mpg.de/.[59] Access to Decrypthon massive comparison output at
http://www.infobiogen.fr/services/decrypthon/decrypthon gb.htm.[60] Access to TeraProt massive comparison output at
http://www.infobiogen.fr/services/Teraprot/.[61] J.C. Aude, Y. Diaz-Lazcoz, J.J. Codani, J.L. Risler, Applications of the
pyramidal clustering method to biological objects, Comput Chem. 23(3–4) (1999) 303–315.
O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 427
[62] A. Louis, E. Ollivier, J.C. Aude, J.L. Risler, Massive sequencecomparisons as a help in annotating genomic sequences, Genome Res.11 (1) (2001) 1296–1303.
[63] Access to PhytoProt protein clusters at http://genoplante-info.infobiogen.fr/phytoprot/.
[64] R. Apweiler, M. Biswas, W. Fleischmann, A. Kanapin, Y. Karavi-dopoulou, P. Kersey, E.V. Kriventseva, V. Mittard, N. Mulder, I. Phan,E. Zdobnov, Proteome analysis database: Online application of InterProand CluSTr for the functional classification of proteins in whole genomes,Nucleic Acids Res. 29 (1) (2001) 44–48.
[65] E.V. Kriventseva, F. Servant, R. Apweiler, Improvements to CluSTr: Thedatabase of SWISS-PROT+TrEMBL protein clusters, Nucleic Acids Res.31 (1) (2003) 388–389.
[66] R. Petryszak, E. Kretschmann, D. Wieser, R. Apweiler, The predictivepower of the CluSTr database, Bioinformatics 21 (18) (2005)3604–3609.
[67] Access to CluSTr protein clusters at http://www.ebi.ac.uk/clustr/.[68] L. Brocchieri, Phylogenetic inferences from molecular sequences: Review
and critique, Theor. Popul. Biol. 59 (1) (2001) 27–40.[69] E. Zuckerkandl, The evolution of hemoglobin, Sci. Am. 212 (1965)
110–118.[70] J. Felsenstein, PHYLIP—Phylogeny inference package (Version 3.2),
Cladistics 5 (1989) 164–166.[71] H.A. Schmidt, K. Strimmer, M. Vingron, A. von Haeseler, TREE-
PUZZLE: Maximum likelihood phylogenetic analysis usingquartets and parallel computing, Bioinformatics 18 (3) (2002)502–504.
[72] C. Botte, C. Jeanneau, L. Snajdrova, O. Bastien, A. Imberty, C. Breton,E. Marechal, Molecular modeling and site-directed mutagenesis of plantchloroplast monogalactosyldiacylglycerol synthase reveal critical residuesfor activity, J. Biol. Chem. 280 (41) (2005) 34691–34701.
Olivier Bastien received two B.S. degrees inPopulation Biology (1999) and Cell Biology andPhysiology (2001) from the University of Grenoble,France and a M.S. in Biomathematics (2002) fromthe University of Paris VI. He received a Ph.D.in Biomathematics from the University of Grenoble(2006) in the Plant Physiology Laboratory, at theCommissariat a l’Energie Atomique, Grenoble, and atGene-IT (GenomeQuest), Rueil Malmaison, France.
He coauthored several papers in the area of biomathematics, biology andbioinformatics. His Ph.D. project focuses on the development of new numericalapproaches for the comparison of compositionally biased sequences in thefield of malaria genomics, using parallelized algorithms. He designed massiveprotein sequence comparison experiments, carried out using the CEA TERAsupercomputer, Bruyere-Le-Chatel. He designed the configuration space ofhomologous proteins using concepts from information theory and theoreticalphysics. He developed statistical methods to assist the analysis of bi-dimensional protein electrophoresis (proteomics) and software for capillaryelectrophoresis. Current research interests include genome annotation theories,phylogenomics, automatic homology prediction, statistics for sequenceanalyses, and epigenetic models.
Philppe Ortet received his M.S. (1995) in computerengineering from the University of Paris VI. In 1997,he joined the Commissariat a l’Energie Atomique inthe technical and scientific information departmentand, in 2001, the Department of Plant Ecophysiologyand Microbiology, Cadarache, France, where hewas in charge of information systems, algorithmicand developments for local teams of biologists. Hecoauthored several papers in the area of integrated
biology and bioinformatics. He developed tools to assist the analysis ofmicroarrays (DNA chips). He recently designed the bioinformatic workflow andtools (GenoBrowser) for the complete annotation of bacterial genomes. Currentresearch interests include complete genome annotation, phylogenomics,automatic homology prediction and sequence analyses.
Sylvaine Roy received a M.S. (1984) in Biochemical En-gineering from the Institut National des Sciences Ap-pliquees (Engineering University), Lyon, France and aM.S. (1986) in Computer Science from Ecole NationaleSuperieure d’Informatique et de Mathematiques Ap-pliquees, Grenoble, France. She worked (1987–1991) inthe research and development software laboratory of theHewlett Packard Company, Grenoble, France. In 1991,she joined the Commissariat a l’Energie Atomique in the
Life Science Division, Grenoble, France, where she was in charge of net-work security, information systems, algorithmic and software developments forlocal teams of biologists. She coauthored several papers in the area of biol-ogy and bioinformatics. She recently managed software development projectsfor the analysis and integration of automatic molecular screening data. Hercurrent research interests include chemogenomics, artificial intelligence,machine-learning, data semantic and data integration.
Eric Marechal is currently in charge of the Compar-ative Chemogenomics team in the Plant PhysiologyLaboratory at the Commissariat a l’Energie Atomique,Grenoble, France. He received an Agregation teachingdegree in Life Science (1990) from the Ministry of Na-tional Education, a M.S. (1991) in Cell Biology fromthe Ecole Normale Superieure de Lyon, France, and aPh.D. (1994) in Molecular and Cell Biology from theUniversity of Grenoble, France. From 1994 to 1997
he joined the Nam-Hai Chua Laboratory at the Rockefeller University, NewYork, USA, as a Human Frontier post-doctoral fellow. After a short period atRhone–Poulenc Industrialisation, Lyon, France, he joined the CNRS in 1998.Since 1998, he has focused on the plant features discovered in apicomplexanparasites (a group of species including the malaria parasite Plasmodium fal-ciparum), taking this opportunity to start target and drug discovery projects,in collaboration with the Cerep and Gene-IT Companies, Rueil Malmaison,France. He coauthored several papers in biochemistry, cell biology, plant physi-ology, toxoplasmosis and malaria research, biomathematics, bioinformatics andchemogenomics. His scientific interests include chemogenomics, from phar-macological high-throughput screening to drug candidate development, com-parative genomics of plant and apicomplexans and phylogenomics, from high-throughput protein phylogeny to comparative physiology.