18
Future Generation Computer Systems 23 (2007) 410–427 www.elsevier.com/locate/fgcs The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons Olivier Bastien a,b , Philippe Ortet c , Sylvaine Roy d , Eric Mar´ echal a,* a UMR 5168 CNRS-CEA-INRA-Universit´ e Joseph Fourier, Laboratoire de Physiologie Cellulaire V´ eg´ etale, D´ epartement R´ eponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, France b Gene-IT, 147 avenue Paul Doumer, F-92500 Rueil-Malmaison, France c epartement d’Ecophysiologie V´ eg´ etale et de Microbiologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, France d Laboratoire Biologie, Informatique, Math´ ematiques, D´ epartement R´ eponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, France Received 9 December 2005; received in revised form 30 March 2006; accepted 2 July 2006 Available online 30 August 2006 Abstract Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed, neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion of knowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousands of protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons, pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projection of sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practical solution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in which Z-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary model raising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of the phylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees (TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projects based on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive protein sequence comparison projects. c 2006 Elsevier B.V. All rights reserved. Keywords: Protein space; Protein sequence comparison; TULIP; Configuration space of homologous protein; CluSTr; Z-value 1. Introduction In 2005, more than 3 million protein sequences obtained from numerical translation of DNA coding sequences (the reading frames of genes) were electronically stored in public * Corresponding address: UMR 5019 CNRS-CEA-INRA-Universit´ e Joseph Fourier, Laboratoire de Physiologie Cellulaire V´ eg´ etale, D´ epartement R´ eponse et Dynamique Cellulaire, CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, France. Tel.: +33 04 38 78 49 85; fax: +33 04 38 78 50 91. E-mail address: [email protected] (E. Mar´ echal). databases. The production rate of virtual proteins is exponen- tial, fueled by international funding, laboratory networking and competition. Additional productivity gain is expected soon, fol- lowing a recent technological breakthrough [1], i.e. a DNA se- quencing method allowing a 100 fold increase in throughput, reading for instance 25 million bases of genetic code – the en- tire genome of some fungi – within hours. Publicity following the release of complete genomes raised higher expectations than what could be reasonably achieved in the short term. This is particularly true for the human species, with a hope of accel- erating the identification of genes involved in genetic diseases 0167-739X/$ - see front matter c 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2006.07.016

The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

Embed Size (px)

Citation preview

Page 1: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

Future Generation Computer Systems 23 (2007) 410–427www.elsevier.com/locate/fgcs

The configuration space of homologous proteins: A theoretical and practicalframework to reduce the diversity of the protein sequence space after

massive all-by-all sequence comparisons

Olivier Bastiena,b, Philippe Ortetc, Sylvaine Royd, Eric Marechala,∗

a UMR 5168 CNRS-CEA-INRA-Universite Joseph Fourier, Laboratoire de Physiologie Cellulaire Vegetale, Departement Reponse et Dynamique Cellulaires,CEA Grenoble, 17 rue des Martyrs, F-38054 Grenoble cedex 09, Franceb Gene-IT, 147 avenue Paul Doumer, F-92500 Rueil-Malmaison, France

c Departement d’Ecophysiologie Vegetale et de Microbiologie, CEA Cadarache, F-13108 Saint Paul-lez-Durance, Franced Laboratoire Biologie, Informatique, Mathematiques, Departement Reponse et Dynamique Cellulaires, CEA Grenoble, 17 rue des Martyrs, F-38054

Grenoble cedex 09, France

Received 9 December 2005; received in revised form 30 March 2006; accepted 2 July 2006Available online 30 August 2006

Abstract

Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed,neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion ofknowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousandsof protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons,pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projectionof sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practicalsolution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in whichZ-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary modelraising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of thephylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees(TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projectsbased on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive proteinsequence comparison projects.c© 2006 Elsevier B.V. All rights reserved.

Keywords: Protein space; Protein sequence comparison; TULIP; Configuration space of homologous protein; CluSTr; Z-value

1. Introduction

In 2005, more than 3 million protein sequences obtainedfrom numerical translation of DNA coding sequences (thereading frames of genes) were electronically stored in public

∗ Corresponding address: UMR 5019 CNRS-CEA-INRA-Universite JosephFourier, Laboratoire de Physiologie Cellulaire Vegetale, Departement Reponseet Dynamique Cellulaire, CEA Grenoble, 17 rue des Martyrs, F-38054Grenoble cedex 09, France. Tel.: +33 04 38 78 49 85; fax: +33 04 38 78 5091.

E-mail address: [email protected] (E. Marechal).

0167-739X/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2006.07.016

databases. The production rate of virtual proteins is exponen-tial, fueled by international funding, laboratory networking andcompetition. Additional productivity gain is expected soon, fol-lowing a recent technological breakthrough [1], i.e. a DNA se-quencing method allowing a 100 fold increase in throughput,reading for instance 25 million bases of genetic code – the en-tire genome of some fungi – within hours. Publicity followingthe release of complete genomes raised higher expectationsthan what could be reasonably achieved in the short term. Thisis particularly true for the human species, with a hope of accel-erating the identification of genes involved in genetic diseases

Page 2: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 411

or cancer; for pathogens, with a hope of identifying new fight-ing targets; and for crops, with a hope of improving productivityand quality. Meanwhile, chemical technologies made progressin the throughput of small molecule syntheses. Millions of com-pounds have been physically stored in chemolibraries; theirmolecular structures were electronically stored. Having in hand3D structures of protein active sites and of chemolibraries,methods are available to predict interactions in virtual dock-ing experiments (e.g. [2]). PubChem, a repository for moleculesacting on biological targets, was recently launched [3] and theUniProt protein knowledge base was recently upgraded to re-port toxic doses of small molecules on proteins [4]. Access to anocean of small molecular structures and to a deluge of biolog-ical sequences raised therefore an enthusiastic challenge: “Thegoal for the coming decades will be to explore the overlap be-tween chemistry space and protein space” [5]. No doubt gigan-tic computing power will be demanded to reach this chemoge-nomic horizon. Is this post-genomic prediction exuberant? Be-sides upstream quality of data and downstream methods bywhich interactions might be screened, can we at least definewhat we mean by protein and small molecule spaces? This pa-per focuses on our recent advances toward the definition of astable reference for a protein space and navigation map at thesequence level.

In recent years, large scale comparisons of hundreds ofthousands of protein sequences have been released usingthe calculation power of supercomputers or grid frameworks(Tables 1 and 2). Putting some order into thousands of proteinsfollowing these massive comparisons, using representationsbased on “proximity” criteria, have provided pragmatic rulesto reduce the initial diversity. For a long term use, updatingpipelines should prevent initial results from being lost andshould be easily carried out. A wealth of bioinformaticmethods exist to compare and sort proteins at the level offull-length sequences or of sequence sub-domains and motifs.Combining methods is wise to circumvent the limitations ofeach technique; however no combination is considered asa standard. Thus, although some attempts to organize theprotein space were undergone by pragmatic combinations ofbioinformatic methods, a representation based on such complexworkflows is risky on a long term perspective. Given thenumber of virtual protein sequences (soon exceeding two-digitmillions) and the long-term investment, organizing the proteinspace should start by a simple process, should be biologicallysound, theoretically rather than pragmatically defined so as tobe as explicit as possible, and should be statistically accurate,with the widest validity domain. Because of the CPU cost, suchpre-treatment of the virtual protein sequences should also beeasily incremented with newly released protein sequences.

Comparing protein primary sequences can be carried outfollowing two main methods: the first is based on multiplealignments, usually achieved on sets of sequences presumedto be homologous, the second is based on native pairwisecomparisons. Multiple alignment-based comparisons, raisingphylogenetic protein clusters, are expensive, have to becompletely repeated after each data update and are susceptibleto raise modified phylogeny reconstructions when proteins are

added or removed from the multiple alignment sets. Thus,although protein clusters based on phylogeny is appealing tobiologists, the reduction of protein space diversity based onmultiple alignments demands an exorbitant computing powerand does not allow the conservation of results through updating.For these reasons, multiple alignment-based comparisons donot appear as a practical way to assist the organization of theprotein space.

Few massive pairwise sequence comparisons have beenperformed, some of which being described in Tables 1 and2. Output of an all-by-all comparison of n protein sequencesis an n × n table (Fig. 1). According to the output tableprocessing, it can be either totally recomputed at each databaseupdate, i.e. {new + old}-by-{new + old}, or stored and updatedby computing complementary {new}-by-{new} and {new}-by-{old} tables. Information is extracted from the output table tohelp reduce complexity and diversity at the sequence level.Sets of sequences sharing features are named “clusters” [6]that can be used to extrapolate some functional informationfrom characterized to uncharacterized sequences. In mostcases, results were not exploited by the potential end-users,i.e. bench biologists, at the level one would expect followingsuch expensive experiments. Numerous projects, unlisted here,simply closed the internet portal providing access to theirresults two or three years after release. One reason for thepoor public success of massive comparisons lies in the difficultaccess to protein clusters for the bench biologists and the lackof biologically sound or explicit representation.

In this paper we analyzed the large scale comparisons ofhundreds of thousands of protein sequences that have beenpreviously carried out using the power of supercomputersor grid frameworks and the two general statistical modelsused to assess the alignment score significance. We presentthe configuration space of homologous protein sequences, orCSHP [7], as a projection of the virtual protein sequencespace that is theoretically supported by information theory [8,9]. It is constructed from pair-wise sequence alignment scoresand Z-values obtained after Monte Carlo simulations. In theCSHP, reduction of the protein diversity can be carried outaccording to evolutionary models raising phylogenetic clusters.The accuracy of the Darwinian topology underlying the CSHPprojection depends on the method used to perform sequencecomparisons, the amino acid scoring matrix used to weightthe alignments, the sequence randomization technique and nullmodel used for Monte Carlo simulations and the molecularevolution model used to reconstruct phylogenetic trees. Thespatial representation of proteins in the CSHP can thereforebe easily updated, and the CSHP accuracy and phylogenetictopology be upgraded by improving any of the implementedsub-models. Diversity and complexity of the protein worldis reduced by detecting clusters of phylogenetically relatedproteins, and clusters can be viewed as real phylogenetic trees,named TULIP trees, making sense to a bench biologist. TheCSHP projection aims at providing an underlying order tothe protein space, and to assist the definition of other sub-protein spaces at the structural (sequence similarity only) orfunctional (with links to annotation) levels, as well as subspaces

Page 3: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

412 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Tabl

e1

Mas

sive

all-

by-a

llpr

otei

nse

quen

cepa

irw

ise

com

pari

sons

base

don

alig

nmen

tE-v

alue

stat

istic

s

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

ttab

leO

rgan

izat

ion

prin

cipl

es(p

ragm

atic

rule

vsth

eore

tical

spat

ial

proj

ectio

n)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

CO

G17

,967

(Oct

ober

1997

)>

170,

000

(Nov

embe

r20

05)

Initi

ally

:en

trie

sfr

om7

com

plet

ege

nom

es.

In20

05:

entr

ies

from

73co

mpl

ete

geno

mes

.

nana

BL

AST

PK

arlin

–A

ltsch

ulm

odel

for

E-v

alue

com

puta

tion

E-v

alue

naB

esth

it(B

eT)

ofea

chen

try,

com

pare

dto

alle

ntri

esbe

long

ing

toa

dist

inct

geno

me

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.Pr

agm

atic

rule

:co

nstr

uctio

nof

BeT

grap

hsin

clud

ing

entr

ies

inat

leas

tthr

eege

nom

es

Det

ectio

nof

cons

iste

ntpa

ttern

sin

BeT

grap

hsin

clud

ing

entr

ies

inat

leas

t3ge

nom

es+

insp

ectio

nfo

rpr

otei

ndo

mai

ns.N

ohi

erar

chy

prod

uced

.

Lis

tsof

entr

ies

belo

ngin

gto

clus

ters

(nam

edC

OG

)cl

assi

fied

byoc

curr

ence

/ab

senc

ein

each

geno

me

(spe

cies

profi

ling)

Initi

ally

none

.In

2001

,ate

stw

asin

trod

uced

base

don

anes

timat

eof

the

prob

abili

tyth

atan

entr

yis

assi

gned

toa

CO

Gby

chan

ce

Upd

ate

prot

ein

data

base

geno

me

byge

nom

e;co

mpu

teB

eTw

ithne

wen

trie

sag

ains

ten

trie

sfr

omge

nom

espr

evio

usly

com

pare

d;co

mpl

ete

and

incr

emen

tB

eTgr

aphs

;re

peat

clus

teri

ngfr

omne

wB

eTgr

aphs

;ap

ply

stat

istic

alte

ston

grap

hs

Las

tdat

abas

eup

date

attim

eof

wri

ting:

May

2003

.Upd

ate

freq

uenc

yof

the

CO

Gda

taba

sede

pend

son

the

rele

ase

freq

uenc

yof

relia

ble

com

plet

ege

nom

e.M

ore

freq

uent

upda

tes

ofth

een

try

refe

renc

esar

eun

derg

one.

[ 42–

45]

Tri

be31

1,25

7(M

arch

2003

)

CO

GE

NT

(83

geno

mes

)+

Swis

sPro

t

400

node

Com

paq

Alp

haD

S10

clus

ter

(8h)

naB

LA

STP

Kar

lin–

Alts

chul

mod

elfo

rE

-val

ueco

mpu

tatio

n

E-v

alue

E-v

alue

≤1.

10−

10A

lignm

ent

scor

ean

dE

-val

ues

proc

esse

dto

corr

ect

asym

met

ric

BL

AST

hits

and

scor

es

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.C

onve

rsio

nof

E-v

alue

tabl

ein

toM

arko

vM

atri

x

Mar

kov-

rand

om-

field

clus

teri

ngw

ithT

ribe

MC

Lal

gori

thm

.N

ohi

erar

chy

prod

uced

.

Clu

ster

sof

mul

tiple

gran

ular

ities

byal

teri

ngin

flatio

nva

lue

used

for

clus

teri

ng

naU

pdat

epr

otei

nda

taba

se;

com

pute

mis

sing

com

pari

sons

(old

byol

dan

dol

dby

new

);ap

ply

E-v

alue

cuto

ff;r

epea

tcl

uste

ring

from

new

Mar

kov

Mat

rix

No

curr

ent

publ

icac

cess

topr

otei

ncl

uste

rs

[ 46,

47]

No

curr

ent

publ

icac

cess

topr

otei

ncl

uste

rs

Page 4: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 413

Tabl

e1

(con

tinue

d)

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

ttab

leO

rgan

izat

ion

prin

cipl

es(p

ragm

atic

rule

vsth

eore

tical

spat

ial

proj

ectio

n)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

Prot

oNet

94,1

53(J

uly

2002

)>

1,00

0,00

0(S

epte

mbe

r20

04)

Swis

sPro

t,la

tter

expa

nded

to TrE

MB

Lan

dev

entu

ally

to Uni

Prot

naB

LO

SUM

62B

LA

STP

Kar

lin–

Alts

chul

mod

elfo

rE

-val

ueco

mpu

tatio

n

E-v

alue

E-v

alue

≤10

E-v

alue

sPr

agm

atic

orga

niza

tion

ofda

ta/n

oex

plic

itsp

atia

lpr

ojec

tion.

Pseu

do-m

etri

csba

sed

onE

-val

ues.

Hie

rarc

hica

lde

tect

ion

ofcl

uste

rsfo

llow

ing

ano

n-su

perv

ised

proc

edur

e.D

egre

esof

gran

ular

ityar

ede

fined

bym

ergi

ngcl

uste

rsba

sed

onE

-val

ueav

erag

ede

fined

byth

ear

ithm

etic

,sq

uare

,ge

omet

ric

orha

rmon

icm

eans

com

pute

dw

ithE

-val

ues

take

nas

aki

ndof

met

rics

.

Clu

ster

sof

mul

tiple

gran

ular

ities

follo

win

gth

em

ergi

ngru

le.

Num

ber

and

size

ofcl

uste

rsva

ryw

ithth

em

etho

dof

mea

nco

mpu

tatio

n.D

efau

ltst

rate

gyis

unw

eigh

ted

pair

grou

pw

ithar

ithm

etic

mea

n(U

PGM

A),

gene

ratin

gU

PGM

tree

s(w

hich

are

not

phyl

ogen

etic

tree

s)

Non

-sup

ervi

sed

defin

ition

ofcl

uste

rhi

erar

chie

sba

sed

onau

tom

atic

cond

ensa

tion

rule

s

Upd

ate

prot

ein

data

base

;the

mas

sive

com

pari

son

upda

te,i

.e.

all

com

pari

sons

(old

+ne

wby

old

+

new

)or

only

mar

gina

l(ol

dby

new

and

new

byne

w),

isna

;the

upda

tepr

oced

ure

for

E-v

alue

sis

na;r

epea

tU

PGM

Acl

uste

ring

.

Las

tdat

abas

eup

date

:Se

ptem

ber

2004

.Upd

ate

freq

uenc

yis

curr

ently

once

ever

y6

mon

ths.

[ 48–

50]

(con

tinue

don

next

page

)

Page 5: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

414 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Tabl

e1

(con

tinue

d)

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

ttab

leO

rgan

izat

ion

prin

cipl

es(p

ragm

atic

rule

vsth

eore

tical

spat

ial

proj

ectio

n)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

Prot

oMap

365,

174

(May

2000

)

Swis

sPro

t+

TrE

MB

LM

OSI

XPa

ralle

lsy

stem

and

clus

ter

ofPC

s

BL

OSU

M50 +

BL

OSU

M62

BL

AST

P+

FAST

A+

SW

Exp

ecta

tion

valu

efo

ran al

ignm

ent

scor

eco

mpu

ted

with

the

real

dist

ribu

tion

ofal

lsc

ores

obta

ined

with

one

ofth

eal

igne

dse

quen

ces

agai

nsta

llth

epr

otei

nda

taba

se;

for

BL

AST

P,E

-val

ueis

obta

ined

from

the

Kar

lin–A

ltsch

ulm

odel

E-v

alue

E-v

alue

≤1

Con

sens

usE

-val

ues

obta

ined

afte

rnu

mer

ical

norm

aliz

atio

nof

E-v

alue

sob

tain

edw

ithal

lal

ignm

ent

algo

rith

ms

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.Pr

agm

atic

rule

:de

tect

ion

ofgr

aphs

,in

whi

chth

ew

eigh

tof

aned

geco

nnec

ting

two

prot

eins

isth

eE

-val

ue(M

arko

vian

repr

esen

tatio

n).

Pseu

do-m

etri

csba

sed

onE

-val

ues:

cons

ensu

sE

-val

ues

are

take

nas

dist

ance

sfo

rtr

eere

pres

enta

tions

Hie

rarc

hica

lde

tect

ion

ofgr

aphs

star

ting

with

agr

anul

arity

dete

rmin

edby

mos

tst

ring

ent

E-v

alue

s(<

10−

100),

and

grow

ing

byde

tect

ing

rela

tions

with

E-v

alue

slo

wer

than

10−

95,10

−90

,et

c.10

0(=

1)

Gra

phs

with

E-v

alue

sas

edge

wei

ghts

and

E-v

alue

base

dde

ndro

gram

s(w

hich

are

not

phyl

ogen

etic

tree

s)

Stat

istic

alte

stap

plie

dto

dete

ctan

del

imin

ate

“pro

blem

atic

”an

d“p

ossi

bly

fals

e”co

nnec

tions

betw

een

unre

late

dpr

otei

ns.

Qua

lity

ofco

nnec

tions

isde

fined

asm

inus

the

log

ofth

ege

omet

ric

mea

nof

pair

wis

eE

-val

ues

betw

een

conn

ecte

dse

quen

ces

Upd

ate

prot

ein

data

base

;sa

veor

igin

alsc

ore

dist

ribu

tion

and

scor

esw

ithE

-val

ues

<1;

com

pute

mis

sing

com

pari

sons

(old

byol

dan

dol

dby

new

);in

crem

ent

scor

edi

stri

butio

n;co

mpu

teal

lE

-val

ues

with

the

new

scor

edi

stri

butio

n;ap

ply

stat

istic

alte

ston

E-v

alue

;de

fine

grap

hs;a

pply

stat

istic

alte

ston

grap

hs

Las

tdat

abas

eup

date

attim

eof

wri

ting:

May

2000

.

[ 51–

53]

Page 6: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 415

Tabl

e1

(con

tinue

d)

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

ttab

leO

rgan

izat

ion

prin

cipl

es(p

ragm

atic

rule

vsth

eore

tical

spat

ial

proj

ectio

n)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

SIM

AP

>3,

500,

000

(Fro

m20

03to

Oct

ober

2005

)

Uni

Prot

+m

ips

nonr

edH

+PF

AM

+PD

B+

entr

ies

of com

plet

ege

nom

es

Ver

satil

epi

pelin

eru

nnin

gon

Sun

Gri

deng

ine

clus

ters

and

BO

INC

-bas

edgr

idsy

stem

s

BL

OSU

M50

FAST

A+

SWFA

STA

isus

edin

alo

wco

stpr

erun

tode

tect

best

hits

and

SWis

used

ina

seco

ndru

non

the

sele

cted

pair

sof

sequ

ence

sto

com

pute

mos

texa

ctal

ignm

ent

scor

es

E-v

alue

SW scor

e≥

80

Alig

nmen

tra

wpa

ram

eter

sin

clud

ing

SWsc

ore,

and

E-v

alue

s

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.C

onve

rsio

nof

E-v

alue

tabl

ein

toM

arko

vM

atri

x.D

etec

tion

ofbi

dire

ctio

nal

best

hits

.

Mar

kov-

rand

om-

field

clus

teri

ngw

ithA

mst

erda

mM

CL

algo

rith

m.

No

hier

arch

ypr

oduc

ed.

Clu

ster

sof

mul

tiple

gran

ular

ities

byal

teri

ngin

flatio

nva

lue

used

for

clus

teri

ng.

Clu

ster

sar

epr

ovid

edas

lists

ofen

trie

sth

atca

nbe

sort

edba

sed

onal

ignm

ent

para

met

ers

(e.g

.len

gth,

iden

tity,

scor

e,Z

-sco

re).

Lis

tis

used

toge

nera

tem

ultip

leal

ignm

ents

and

hidd

enM

arko

vm

odel

s.

naU

pdat

epr

otei

nda

taba

se;

com

pute

mis

sing

com

pari

sons

(old

byol

dan

dol

dby

new

);ap

ply

SWsc

ore

cuto

ff;r

epea

tcl

uste

ring

from

new

Mar

kov

Mat

rix

Las

tdat

abas

eup

date

attim

eof

wri

ting:

Oct

ober

2005

.Im

port

ant

upda

tefr

eque

ncy

iscu

rren

tly∼

bim

onth

lyw

ithpe

aks

ever

y6

mon

ths.

[ 54,

55]

(con

tinue

don

next

page

)

Page 7: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

416 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Tabl

e1

(con

tinue

d)

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

ttab

leO

rgan

izat

ion

prin

cipl

es(p

ragm

atic

rule

vsth

eore

tical

spat

ial

proj

ectio

n)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

SYST

ER

SV

41,

168,

498

(Nov

embe

r20

03)

Swis

sPro

t+

TrE

MB

L+

entr

ies

of11

com

plet

eeu

kary

otic

geno

mes

Para

cel

Gen

eMat

cher

Wor

ksta

tion

naSW

Exp

ecta

tion

valu

efo

ran al

ignm

ent

scor

eco

mpu

ted

with

the

real

dist

ribu

tion

ofal

lsc

ores

obta

ined

with

one

ofth

eal

igne

dse

quen

ces

agai

nsta

llth

epr

otei

nda

taba

se

E-v

alue

E-v

alue

≤0.

05E

-val

ues

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.Pr

agm

atic

rule

:de

tect

ion

ofgr

aphs

,in

whi

chth

ew

eigh

tof

aned

geco

nnec

ting

two

prot

eins

isth

eE

-val

ue(M

arko

vian

repr

esen

tatio

n).

Pseu

do-m

etri

csba

sed

onE

-val

ues:

E-v

alue

sar

eta

ken

asdi

stan

ces

for

tree

repr

esen

tatio

ns.

SYST

Em

atic

RE

-Sea

rchi

ngcl

uste

ring

:st

epw

ise

cons

truc

tion

ofsi

ngle

linka

gecl

uste

ring

tree

s;hi

erar

chy

base

don

inte

rnal

stru

ctur

eof

tree

s;co

nstr

uctio

nof su

perf

amily

E-v

alue

grap

hs(E

-val

ues

asdi

stan

ces)

;su

bclu

ster

dete

ctio

nat

wea

kco

nnec

tions

.

Lis

tsof

entr

ies

belo

ngin

gto

clus

ters

;po

st-c

ompu

tatio

nof

DIA

LIG

Nm

ultip

leal

ignm

ents

and

ofth

eco

rres

pond

ing

unw

eigh

ted

pair

grou

pw

ithar

ithm

etic

mea

n(U

PGM

A)

tree

s.In

prev

ious

rele

ase,

mul

tiple

alig

nmen

tsw

ere

perf

orm

edus

ing

CL

UST

ALW

Clu

ster

cons

truc

tion

ises

timat

edse

lf-v

alid

atin

g

Upd

ate

prot

ein

data

base

;the

mas

sive

com

pari

son

upda

te,i

.e.

all

com

pari

sons

(old

+ne

wby

old

+

new

)or

only

mar

gina

l(ol

dby

new

and

new

byne

w),

isna

;the

upda

tepr

oced

ure

for

E-v

alue

sis

na;r

epea

tSY

STE

mat

icR

E-S

earc

hing

clus

teri

ng.

Las

tdat

abas

eup

date

attim

eof

wri

ting:

Nov

embe

r20

03.F

our

rele

ases

sinc

e19

99,

with

maj

orm

etho

dm

odifi

catio

ns(f

rom

BL

AST

Pto

SW).

Publ

icac

cess

esto

rele

ases

3an

d4.

[ 56–

58]

na:n

otav

aila

ble

Page 8: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 417

Tabl

e2

Mas

sive

all-

by-a

llpr

otei

nse

quen

cepa

irw

ise

com

pari

sons

base

don

alig

nmen

tZ-v

alue

stat

istic

s

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

tta

ble

Org

aniz

atio

npr

inci

ples

(pra

gmat

icru

levs

theo

retic

alsp

atia

lpr

ojec

tion)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

Dec

rypt

hon

559,

275

(Jan

uary

2002

)

Swis

sPro

t+

TrE

MB

L+ en

trie

sof

76co

mpl

ete

geno

mes

Gri

dco

mpu

ting

mod

elru

nnin

gw

ithun

used

com

putin

gpo

wer

of75

,000

PCs

(200

hea

ch),

netw

ork

arch

itect

ure

corr

espo

nds

toa

virt

ual

calc

ulat

ion

pow

er>

40Te

raflo

ps

BL

OSU

M62

SWZ

-sco

reco

mpu

tatio

nby

50sh

uffli

ngof

each

alig

ned

sequ

ence

(100

shuf

fling

per

com

pari

son)

Z-s

core

Z-s

core

≥5

Alig

nmen

tra

wpa

ram

eter

san

dZ

-sco

re

naB

asic

defin

ition

base

don

Z-s

core

thre

shol

ds.

No

hier

arch

ypr

oduc

ed.

Clu

ster

sar

epr

ovid

edas

lists

ofen

trie

sth

atca

nbe

sort

edba

sed

onal

ignm

ent

para

met

ers

(e.g

.len

gth,

iden

tity,

scor

e,Z

-sco

re)

nana

Las

tdat

abas

eup

date

attim

eof

wri

ting:

Janu

ary

2002

[ 59]

Tera

Prot

240,

000

(Jul

y20

02)

entr

ies

of67

com

plet

ege

nom

es

645

node

Com

paq

ES4

5cl

uste

r(1

6to

512

proc

esso

rsus

edde

pend

ing

onth

ejo

bs;

76.3

hcu

mul

ated

time,

equi

vale

ntTe

ra)

PAM

10SW

Z-s

core

com

puta

tion

by10

0sh

uffli

ngof

each

alig

ned

sequ

ence

(200

shuf

fling

per

com

pari

son)

and

cons

erva

tion

ofth

em

inim

umZ

-sco

reob

tain

ed

Z-s

core

SW scor

e≥

220

Alig

nmen

tra

wpa

ram

eter

san

dZ

-sco

re

naB

asic

defin

ition

base

don

Z-s

core

thre

shol

ds.

No

hier

arch

ypr

oduc

ed.

Clu

ster

sar

epr

ovid

edas

lists

ofen

trie

sth

atca

nbe

sort

edba

sed

onal

ignm

ent

para

met

ers

(e.g

.len

gth,

iden

tity,

scor

e,Z

-sco

re)

naU

pdat

epr

otei

nda

taba

se;

com

pute

mis

sing

com

pari

sons

(old

byol

dan

dol

dby

new

);ap

ply

scor

ecu

toff

Las

tdat

abas

eup

date

attim

eof

wri

ting:

July

2003

with

entr

ies

of22

addi

tiona

lge

nom

es

[ 60]

(con

tinue

don

next

page

)

Page 9: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

418 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Ta

ble

2(c

ontin

ued)

Proj

ect

nam

ePr

otei

nse

quen

ceD

BPa

irw

ise

com

pari

son

Div

ersi

tyre

duct

ion

and

clus

teri

ngU

pdat

ing

Ref

eren

cean

din

tern

etac

cess

Num

ber

ofno

n-re

dund

ant

sequ

ence

s(fi

rstr

elea

seda

te)

Sour

ceC

ompu

ting

reso

urce

sA

min

oac

idsc

orin

gm

atri

x

Alig

nmen

tal

gori

thm

Alig

nmen

tsta

tistic

sO

utpu

tta

ble

Org

aniz

atio

npr

inci

ples

(pra

gmat

icru

levs

theo

retic

alsp

atia

lpr

ojec

tion)

Clu

ster

ing

met

hod

and

prod

uced

hier

arch

y

Clu

ster

repr

esen

tatio

nC

lust

erst

atis

tics

Pipe

line

Freq

uenc

y

Mod

elSt

atis

tical

para

met

erC

utof

f

Phyt

oPro

t14

,723

(Jan

uary

2001

)

All

entr

ies

ofpl

ants

in Swis

sPro

t+

TrE

MB

L

Mul

tipro

cess

orSu

nSp

arc

serv

er45

00

naSW

Z-s

core

com

puta

tion

byn

shuf

fling

(var

ying

from

30to

600

toke

epco

ntro

lto

the

vari

ance

ofZ

)of

each

alig

ned

sequ

ence

and

cons

erva

tion

ofth

em

inim

umZ

-sco

reob

tain

ed.

Z-s

core

Z-s

core

>14

Alig

nmen

tra

wpa

ram

eter

san

dZ

-sco

re

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.U

seof

Z-s

core

tabl

efo

rpy

ram

idal

clas

sific

atio

n.Po

st-a

naly

sis

ofse

quen

ces

tode

tect

poss

ibly

shar

edsu

b-do

mai

nsw

ithin

clus

ters

.

Clu

ster

sde

fined

usin

gZ

-sco

reth

resh

olds

.Py

ram

idal

clas

sific

atio

nof

each

clus

ter

base

don

Z-s

core

s.So

me

prot

eins

may

belo

ngsi

mul

tane

ousl

yto

two

clas

ses.

Pyra

mid

alcl

assi

ficat

ion

tree

s(w

hich

are

not

phyl

ogen

etic

tree

s)an

dlis

tof

prot

eins

with

PRO

DO

Mba

sed

sub-

dom

ain

repr

esen

tatio

n.

naU

pdat

epr

otei

nda

taba

se;

com

pute

mis

sing

com

pari

sons

(old

byol

dan

dol

dby

new

);ap

ply

Z-s

core

cuto

ff;r

epea

tcl

uste

rde

tect

ion

and

pyra

mid

alcl

assi

ficat

ion.

Las

tdat

abas

eup

date

attim

eof

wri

ting:

July

2002

[ 61–

63]

Clu

STr

462,

000

(May

2004

)

Initi

ally

mam

mal

ian

and

plan

tpr

otei

nsin Sw

issP

rot

+ TrE

MB

Lan

den

trie

sof

3co

mpl

ete

euka

ryot

epr

otei

ns,

latte

rex

pand

edto U

niPr

ot(i

nclu

ding

195

com

plet

ege

nom

es)

SAR

Asu

perc

ompu

ter,

1024

-CPU

syst

emco

nsis

ting

oftw

o51

2-C

PUSG

IO

rigi

n38

00(o

new

ith32

to25

6C

PUpa

rtiti

on,t

heot

her

with

asi

ngle

512

CPU

part

ition

,nu

mbe

rof

proc

esso

rsw

asad

just

edde

pend

ing

onjo

bs,t

otal

run:

2m

onth

s)

naSW

Z-s

core

com

puta

tion

by10

0sh

uffli

ngof

each

alig

ned

sequ

ence

(200

shuf

fling

per

com

pari

son)

and

cons

erva

tion

ofth

em

inim

umZ

-sco

reob

tain

ed

Z-s

core

naA

lignm

ent

raw

para

met

ers

and

Z-s

core

Prag

mat

icor

gani

zatio

nof

data

/no

expl

icit

spat

ial

proj

ectio

n.U

seof

Z-s

core

tabl

efo

rsi

ngle

-lin

kage

clus

teri

ng.

Lin

ksof

resu

ltsw

ithkn

owle

dge

base

s,th

roug

hth

eIn

tegr

8pl

atfo

rm.

Hie

rarc

hica

lcl

uste

rsde

fined

usin

gZ

-sco

re.

Clu

ster

sar

ede

term

ined

by sing

le-l

inka

gem

etho

d.H

iera

rchy

isbu

iltin

crem

enta

lly.

Bin

ary

fore

stre

pres

entin

gth

ehi

erar

chy

ofcl

uste

rs,i

nw

hich

each

pare

ntcl

uste

rha

son

lytw

ocl

uste

rs.

Prun

ing

ofcl

uste

rsfo

llow

ing

ase

tof

rule

s(e

.g.e

xclu

sion

ofa

clus

ter

whe

nits

mem

bers

form

≥90

%of

itspa

rent

set,

sing

leto

ns).

Obt

aine

dcl

uste

rsar

eco

llect

edin

Clu

STr

Slim

.

Upd

ate

prot

ein

data

base

;co

mpu

tem

issi

ngco

mpa

riso

ns(o

ldby

old

and

old

byne

w);

appl

yZ

-sco

recu

toff

;rep

eat

clus

ter

dete

ctio

nan

dhi

erar

chy

clas

sific

atio

n.

Las

tupd

ate

attim

eof

wri

ting:

May

2004

.M

argi

nal

upda

teof

data

base

with

Uni

Prot

entr

ies

isan

noun

ced

tobe

bi-w

eekl

yat

Clu

STr

inte

rnet

port

al.

[ 64–

67]

na:n

otav

aila

ble

Page 10: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 419

Fig. 1. Massive all-by-all protein sequence comparison and protein spacereduction workflow. From a protein sequence database, a massive comparisonis carried out using an amino acid scoring matrix, an alignment method and acorresponding statistical model. The output of the comparison is a table thatis used for the reduction of the initial protein diversity at the sequence level,following pragmatic rules. Obtained clusters can be viewed as lists, graphs, orclassification trees which are not phylogenetic trees. According to the statisticalmodel used to estimate the accuracy of alignment scores, the update of theinitial protein sequence database can imply a complete re-computation of themassive all-by-all pairwise alignment in order to update the clusters.

interacting with families of drug scaffolds. It can therefore be amilestone toward a structured definition of a referential proteinspace for phylogenomics and chemogenomics. We evaluatedthe possible use of the CSHP projection to map and navigateinto the protein space using current massive protein comparisonoutputs, and we propose guidelines for future generations ofmassive protein sequence comparison projects.

2. Virtual protein sequence databases

Most of the millions of virtual protein sequences deducedfrom DNA automatic sequencing, and the millions to come,

will not be experimentally confirmed, neither their functiondirectly analyzed. Electronic storage of genes is scatteredin specific databases (e.g. GDB for Human [10], TAIR forArabidopsis [11], PlasmoDB for Plasmodium [12], etc.) andgathered in three major generalist repositories (e.g. EMBL,at the European Bioinformatic Institute [13], GenBank atthe National Center for Biotechnology Information [14] andDDBJ at the DNA Data Bank of Japan [15]). Institutes incharge of these databases, aware of the risk of ending withan inextricable labyrinth with redundancy and inconsistencies,unified the storage of gene entries and exchange their datadaily [16]. Coexisting protein databases with differing coverageand priorities were recently unified. Briefly, the PIR–PSD [17]and Swiss–Prot databases [18] contain the best characterizedsequences, with detailed and manually curated annotationsof protein function; they represent therefore a reference thebiologists enrich at their own pace. Complementarily, TrEMBL(translation of EMBL nucleotide sequence database) consistsof computer-annotated entries derived from the translation ofall coding sequences in the nucleotide sequence databases,except for entries already referenced in Swiss–Prot [18]. Acentralized authoritative resource, UniProt, has been formed byuniting PIR–PSD, Swiss–Prot and TrEMBL databases [4]. Theproportion of protein sequences with high-quality functionalannotation in the complete virtual protein sequence set was∼13% in 2003, ∼10% in 2004 and lower than 8% in 2005.This proportion will fall below 1% in 2008. Given the cost andlow throughput of experimental protein characterization and ofmanual curation of sequence annotations, our exploration of>92% of the protein space (>99% in 3 years) relies on ourability to extrapolate the most accurate navigation maps in theprotein space.

3. Sequence alignment score measures, optimization meth-ods and statistics for a homology-based reduction of theprotein space diversity

Given a new protein sequence, the widely used methodto start with is a homology-based annotation transfer, i.e. thetransfer of portions of knowledge from related sequences storedin databases, on the basis of a suspected common evolutionaryorigin. When homologues are detected in distinct organisms,they are termed orthologues and when they are suspected toderive from gene duplication inside a given organism, theyare termed paralogues. Sequence homology is assessed usingdynamic programming techniques similar to those developedto calculate the edit distances between strings. In all cases,the underlying rationale for homology search at the sequencelevel relies on a fundamental postulate that can be stated as:“the closer in the evolution, the more alike and conversely,the more alike, probably the closer in the evolution”. Genomeannotations based on homology, performed by numerous teamswithout any standard workflow, ended with a mild resultsince ∼25% of the virtual protein sequences have no knownhomologues [5], a figure that reaches ∼60% in some organisms,such as Plasmodium falciparum [19]. Paucity of detectedhomologues may be either due to a real lack of homologues

Page 11: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

420 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

or a technical failure (false negatives). Selection of scoringmeasures, sequence alignment methods and statistics for amost accurate homology-based reduction of the protein spacediversity should therefore be carefully examined before oneachieves a massive comparison.

Firstly, alignments are obtained by maximizing (orminimizing) a quantity named “score”, determined at the levelof aligned amino acids. To that purpose, scoring matricesare used to weight and sum scores of compared aminoacids and find optimal alignments, computed with a dynamicprogramming procedure. Scoring matrices have been foundto be similarity matrices as well [20,21]. Many scoringmatrices are available [22–25] and evaluation studies led tothe conclusion that those based on a log-odds ratio, likeBLOSUM [23], outperformed the others [26]. BLOSUM wascomputed using blocks of aligned sequences with:

s(i, j) =1λ

log(

qi j

pi p j

)where i and j are aligned amino acids, s(i, j) the score,qi j the frequency of the observation: “i is aligned with j”,i.e. the target frequency, pi and p j the background frequenciesof i and j respectively, and λ a scaling factor. We furtherdemonstrated [25] that the score was also related to the mutualinformation in the sense of Hartley [8,9], between the twoconsidered amino acids:

s(i, j) =1λ

I (i; j)

where I is the mutual information between events, i.e. thereduction of the uncertainty of event i due to the knowledgeof j . Massive comparisons using scoring matrices thus definedallow therefore the extraction of mutual information sharedby sequences. The postulate for homology-based annotationtransfer is therefore supported by information theory [8,9]and is re-formulated accordingly as: “Given two homologousproteins, the closer in the evolution, the greater their mutualinformation at the sequence level and conversely, the greater themutual information at the sequence level, probably the closerin the evolution”. Thus selection of the amino acid scoringmatrix can influence the quality of the information one cantransfer, and therefore the accuracy of the homology-basedreduction.

Secondly, few algorithms exist to optimize the protein se-quence alignments, being either global (at the full-length se-quence level) or local. Global alignment algorithms [27] arenot accurate to assess homology of domains in modular pro-teins [21]; local alignments are better suited [28–30]. The SWdynamic algorithm [28] is considered exact, but its computa-tional cost is too high for traditional computing resources, evenfor small samples of sequences. Heuristic approaches were suc-cessful in speeding up alignment processes, the most popularbeing the low CPU demanding BLASTP [30] and FASTA [29]algorithms. BLASTP proved to be efficient from routine com-parisons of small samples of proteins directly handled by benchbiologists, to batch comparisons of large numbers of sequencesmonitored by bioinformaticians. Rank in computation speed,

i.e. SW < FASTA < BLASTP, is considered the reverse to thatof accuracy. The BLASTP, FASTA and SW methods were suc-cessfully parallelized and Grid-enabled (e.g. [31–33]).

Thirdly, confidence in alignment results is estimatedaccording to statistical criteria. It is important to notethat an alignment algorithm comes with a statistical modelimplemented in the code, particularly in the BLAST package,and it is therefore difficult to discuss the current view onsequence comparison methods and statistics independently.Two major statistical models are currently used to testalignment scores. The most common test is an estimate ofthe E-value (short for Expectation value), i.e. the number ofalignments one expects to find in the database by chance,with equivalent or better scores. It can be determined fromthe complete distribution of scores. The BLASTP associatedstatistics defined by Karlin and Altschul [34] are based onthe probability of an observed local alignment score accordingto an extreme value distribution. The number of high-scoringsequence matching regions is estimated above a threshold by aPoisson distribution and allows the computation of a P-value,that the score could have occurred by chance, related to theE-value by the formula:

E = log(1/1 − P).

The E-values can therefore be computed based on thescore distribution or on the Karlin–Altschul model (seeTable 1). Validity of the Karlin–Altschul E-value computationmodel (and subsequent improvements) requires two restrictiveconditions: first, individual residue distributions for the twosequences should not be ‘too dissimilar’ and second, sequencelengths ‘should grow at roughly equal rates’ [34]. Validityrestrictions listed here are fully acceptable when dealingwith protein sequences of average lengths and amino aciddistribution, and BLASTP is a good compromise when havingaccess to limited CPU power. However, compositionallybiased genomes such as that of Plasmodium falciparum falloutside of the validity domain for a BLASTP comparisonwith unbiased sequences [35,36]. Based on a BLASTPsemi-automatic annotation procedure, around 60% of thePlasmodium sequences did not have any apparent homologywith sequences from other genomes [19] although suchproteomic uniqueness appeared doubtful. In addition, thedependence of the P-value calculation on the data bank sizeimplies that results may fluctuate through updating, which isnot compatible with the construction of a stable reference.

An alternative method to assess the relevance of analignment was introduced by Lipman and Pearson [37]. Ituses the Monte Carlo techniques to investigate the significanceof a given score calculated from the alignment of two realsequences a and b. It can be used to sort results obtainedby any comparison methods including BLASTP, althoughthis has not yet been achieved at a massive scale. It iscurrently used to estimate the probabilities of SW comparisons.The method consists in computing alignments of shuffledsequences from a and b [38]; the variables corresponding tothe shuffled sequences are termed a∗ and b∗ respectively. Thesecomparisons allow the estimate of an empirical mean score (µ)

Page 12: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 421

and standard deviation (σ ) from the distribution of the randomvariable S(a∗, b∗). The Z-value is then defined as:

Z =s(a, b) − µ

σ.

For the computation of a Z-value between two sequences,only one of the two sequences is generally shuffled [39]. Forpractical reasons, the Z-value was reformulated according toa conservative principle, that is the minimum of Z(a, b∗) andZ(a∗, b) [40]. Alternatively, Z-value can be estimated by anaverage of Z(a, b∗) and Z(a∗, b) (see Table 2). Computationof Z(a, b) is known to be convergent and depends on theaccuracy of the estimation of µ and σ , and therefore on thenumber of shuffling, ranging from 100 to 1000 [35,39,41]. Theasymptotic law of Z-value was shown to be independent ofsequence length and amino acid distribution [40]. The estimateof the Z-value was additionally dependent on the shufflingmethod because the shuffling procedure respects the sequencecomposition but breaks down biological structures [40]. Wedemonstrated the TULIP theorem (theorem of the upper limitof a score probability, [35]) assessing that Z-values can be usedas a statistical test, a single-linkage clustering criterion and that1/Z-value2 was an upper limit to the probability of an alignmentscore whatever the actual probability law was. From the TULIPtheorem and corollaries [7,35], the comparison of a protein toa given reference sequence a, weighed by an alignment score,is characterized by a bounded probability that the alignment isobtained by chance.

In practice, a Z-value table can be analyzed using theTULIP theorem to detect pairs of proteins that are probablyhomologues following a Z-value confidence cutoff. Forinstance, a Z-value above 10 allows an estimate that thealignment is significant with a statistical risk of 1/Z-value2,i.e. 0.01.

The TULIP theorem provides therefore sustained ar-guments in favor of the Lipman–Pearson model to es-timate an alignment probability for massive comparisons;in particular the two restrictions of the Karlin–Altschulmodel, i.e. that the amino acid distributions of com-pared sequences should not be too dissimilar and thattheir lengths should be relatively close, are not required.In addition, Z-values are completely independent of thealignment length and are therefore normalized values, andZ-value statistics are independent of the databank size andtherefore stable after each database update. Eventually, up-date is made easy since the simple collection of {new}-by-{old} and {new}-by-{new} is sufficient to complete theZ-value table obtained from an existing all-by-all comparison(Figs. 2 and 3).

Tables 1 and 2 shows a comparative record of protein all-by-all massive comparisons that have been performed to organize,map, and reduce the diversity of the protein space. Whenavailable, information on the computing resource (mostly gridsand supercomputers) was provided. Table 1 describes projects(COG [42–45], Tribe [46,47], ProtNet [48–50], ProtoMap [51–53], SIMAP [54,55], SYSTERS release 4 [56–58]) in whichalignment significance was assessed from E-value estimates

that are less CPU-demanding than Z-value computations ina single massive comparison experiment. The handling ofthe output n × n table of E-values requires pragmatic post-processing normalization, including asymmetric correctionsof E-values obtained after permutation of the two alignedsequences (e.g. Tribe), or consensus E-value computationafter alignment with different algorithms (e.g. ProtoNet). Inall cases, there is no theoretical support to justify that anE-value table can be converted into a rigorous and stablemetric. The E-value table can be converted into a Markovmatrix (e.g. Tribe, SIMAP), or a close graphic equivalent,i.e. graphs connecting protein entries with E-values as weightsfor graph edges (e.g. COG, ProtoMap, SYSTERS). The proteinsets are organized either by detecting graphs and sub-graphsfollowing pragmatic rules, with granularities depending onE-value thresholds, or by distance clustering using E-value asa pseudo-metrics, or by Markov-random-field clustering. Noneof the obtained organization of the protein sequences can benamed a spatial projection and none of the obtained clusterscan be represented as a phylogenetic reconstruction. Eventually,the economy of computing E-values in an all-by-all comparisonexperiment is not gained in the updating process that requires acomplex pipeline or a complete re-calculation.

Table 2 describes projects (Decrypthon [59], TeraProt [60],PhytoProt [61–63], CluSTr [64–67]) in which alignment sig-nificance was assessed from Z-value estimates. All compu-tations were undergone with SW. It is possible to computescores and Z-values with other alignment algorithms such asBLASTP, but it is very likely that this was not done sim-ply because E-value calculation is directly implemented inthe BLASTP code, and because Z-value computation withBLASTP is not yet coded. Handling of the output n × ntable does not require in-depth processing and is compli-ant with a straightforward updating process, i.e. {new}-by-{new} and {new}-by-{old} computations. Because Z-valuesare sufficient to assess an alignment significance [7,35], re-sults are provided either as raw lists one can sort followingZ-values (Decrypthon, TeraProt), or as clusters using Z-valuesas single linkage criterion (ClusTr) or more sophisticatedclusters obtained following pyramidal classification (Phyto-Prot). The ClusTr comparison provided a hierarchical proteinspace map for Integr8, the integrated functional and moleculardatabase/knowledgebase of the EBI [65–67]. Although Z-valuecomputation has a high initial CPU-cost, the statistic validityand easy update of the Z-value table are arguments support-ing that future massive comparisons using grid and supercom-puter power at best should get inspired by this first generationof projects listed in Table 2.

From this overview of alignment score measures, optimiza-tion methods and statistics, the methods of choice for a reduc-tion of the virtual proteome from an all-by-all sequence com-parison seem to include the use of log odds ratio-derived aminoacid scoring matrices (such as BLOSUM), the exact and CPUdemanding SW algorithm, and statistics based on Z-values.This workflow is summarized in Figs. 2 and 3.

Page 13: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

422 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Fig. 2. Massive all-by-all protein sequence comparison and protein space reduction workflow, based on Z-value statistics and protein projection in the CSHP. Froma protein sequence database, a massive comparison is carried out using an amino acid scoring matrix such as BLOSUM, the SW alignment method and the TULIPstatistical model based on Z-values (requiring Monte Carlo simulation). The output of the comparison is a table containing SW scores and Z-values that are used forthe reduction of the initial protein diversity at the sequence level, following the rigorous CSHP spatial projection. Obtained clusters can be viewed as phylogeneticTULIP trees.

4. The configuration space of homologue proteins (CSHP)and its phylogenetic topology, based on pair-wise Z-valueprobabilities

We introduced a representation of protein sequences usingpair-wise alignments in which Z-values are computed andstored [7]. In a set of n homologous proteins, any sequence acan be selected as a reference in respect to which the n − 1others are compared. Such a geometric representation ofobjects relative to a fixed frame is known as a configurationspace (CS), and named accordingly the configuration spaceof homologous proteins or CSHP. In a set of n homologousproteins, it is therefore possible to define n referencesfor the CSHP by permuting the sequences considered asreferential. An interesting property of the CSHP derives fromthe demonstration that sequence similarity computed for an

alignment is an expression of the mutual information sharedby protein sequences, as stated in the information theory [8,9]. The CSHP is conservative for mutual information, in theway physical spaces can be conservative for metrics, forcesor energy. Mutual information with a referential sequence a issufficient for the full positioning of the n − 1 homologues, andthe positioning of proteins that share some homology with thereference is unambiguous, unique, and unaltered when proteinsare added or removed.

In the CSHP, features separating objects, i.e. the sequences,can be used to compute a probabilistic expression ofproximity based on Z-values [7], out of which one candeduce a divergence time t , following assumptions derivedfrom an evolutionary model [68]. For instance, given aand b two homologous sequences, the simple molecularclock hypothesis [69] supposes that t is a measure of the

Page 14: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 423

Fig. 3. Updating process for a massive all-by-all protein sequence comparison based on Z-value statistics and protein projection in the CSHP. According to thestatistical model used to estimate the accuracy of alignment scores, the update of the initial protein sequence database {new} implies that a simple {new}-by-{new}

and {new}-by-{old} computation is sufficient in order to update the clusters.

transmutation of a and b as a consequence of a Poissonprocess. In the CSHP, an evolutionary distance is given by thedivergence observed between two sequences knowing that theyshare some features (the observed sequences a and b) and thatthey were identical before the speciation event (an unknownancestral sequence u). Considering b∗a shuffled sequence ofb, then the probability pid/a that b∗ shares identity with a,knowing that the proximity between b∗ and a is lower than thatbetween b and a, was shown to be bounded above according tothe following formula:

pid/a(b∗) ≤

(s(a,b)−µ1

σ1

)2

(s(a,a)−µ2

σ2

)2

where s(a, a) and s(a, b) are the pair-wise optimal alignmentscores of a with a and a with b, and µ1, and σ1, µ2 and σ2 arethe mean and the standard deviation of S(a, b∗) and S(a, a∗)

respectively. Using the Poisson correction, an expression oft (a, b) is given as the linear combination of the two correctionsof the evolutionary distances deduced from pid/a and pid/b:

t (a, b) = −[log(pid/a(b∗)) + log(pid/b(a∗))]

with a∗ and b∗ the random variables corresponding to theshuffled sequences of a and b respectively. The sum of thelogarithms corresponds to the product of the two probabilities,an expression of the hypothesis of independence of lineage.Thus, t (a, b) appears as a function of Z-value ratios.

For any set of homologous proteins, it is therefore possibleto measure a table of pair-wise divergence times and build

Page 15: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

424 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

Fig. 4. Exploration of the protein space based on the TeraProt sequence massive comparison and protein projection in the CSHP. Given a UniProt protein entry(here, Q9SM44) we extracted sequences from the TeraProt output table that could be aligned with the query with a Z-value > 30. This set contains plant proteins(Q9FZL4, O81770, Q9S193, Q9FZL3, P93115) that have the same function as the query (MGD, i.e. catalyzing the transfer of a galactose from a UDP-galactosedonor onto a lipidic diacylglycerol acceptor) and bacterial proteins (Q8XMH5, Q99V75, Q8NXC3, YPFP BACSU, Q8XIB5, Q97F56, Q9KBH0) having a distinctbut related activity (MURG, i.e. catalyzing the transfer of an N-acetyl-glucosamine from a UDP-N-acetyl-glucosamine donor onto the Lipid 1 acceptor). (a) TULIPtree computed from the Z-value table obtained from the TeraProt project (Monte Carlo simulation with 100 sequence shuffling; amino acid scoring matrix: PAM10). This tree highlights the clear evolutionary divergence between plant and bacterial sequences. (b) TULIP tree obtained after a re-computation of the Z-valuetable with 100 sequence shuffling and the BLOSUM 62 amino acid scoring matrix. (c) TULIP tree obtained after a re-computation of the Z-value table with 2000sequence shuffling and the BLOSUM 62 amino acid scoring matrix. The TULIP tree constructed from the TeraProt database was obtained after <15 s using abench workstation (HP ProLiant G4 Bi-Xeon—2.8 GHz) whereas trees reconstructed with 100 shuffling required 6 min computation and reconstruction with 2000shuffling required less than 2 h. This example illustrates the power of the method and the benefit one can get by exploring massive comparison results using theCSHP projection and TULIP phylogenetic underlying topology.

phylogenetic trees using distance methods (see Fig. 2). Thesetrees were called TULIP trees. TULIP trees were comparedto phylogenetic trees built using conventional methods, forinstance the popular PHYLIP [70] or PUZZLE [71] methodsbased on multiple sequence alignments. TULIP trees provedto perform as well in any unbiased sets of proteins. Moreover,some phylogenetic inconsistencies in trees built with multiple-alignment based methods, particularly including subsets ofcompositionally biased sequences, or with low bootstrappingvalues, could be spectacularly solved with the TULIP tree[7,72].

An advantage of the phylogenetic inference from the CSHPover that obtained from multiple alignments lies precisely in theTULIP tree construction from pair-wise alignments. Whereasthe addition or removal of a sequence can deeply alter themultiple alignment result, and the deduced phylogeny, theZ-value and divergence time tables that serve to reconstruct thephylogenetic trees from the CSHP are the result of a MonteCarlo simulation, which is a converging process at the level ofthe pair-wise comparison and is not altered by database updates.As a result, whereas a phylogenetic database computed frommultiple alignments would require a complete and increasingcomputation for any update, the TULIP tree calculation simplyrequires the calculation of the {new}-by-{old} and {new}-by-{new} Z-values and divergence times (Fig. 3). The CSHP andits assigned Darwinian phylogenetic topology is therefore atheoretical frame of choice for the treatment of results ofmassive pairwise protein sequence comparisons.

5. CSHP projection and TULIP tree construction fromexisting massive comparison tables

Can the CSHP be used to obtain a spatial projection ofproteins and detect clusters using existing massive comparisontables? Can the Darwinian topology assigned to the CSHP beused to return phylogenetic trees (TULIP trees) from theseCSHP projections? We downloaded complete results of oneof the projects listed in Table 2, i.e. the complete release ofTeraProt, which output was made publicly available as lists,one can sort using alignment parameters including Z-values.Thus, given a protein entry, we can extract the complete listof sequences which alignment Z-value was above a definedcutoff. The subsequent projection of this set of sequences inthe CSHP requires the complete table of pairwise Z-values.Because of SW score cutoff used in the TeraProt workflow(see Table 2), these required Z-values may not have been allcollected, and the table may contain holes. A default Z-valueshould therefore be introduced to circumvent this missing data,or the corresponding pairs of sequences should be discarded asthe missing alignment may reflect a non-transitive homology.

Fig. 4a shows an example of a TULIP tree obtainedwith sequences that were aligned with a spinach protein(MGD1, UniProt reference Q9SM44) with a Z-value above 30,computed after 100 sequence shuffling. This set contains plantproteins that have the same function (MGD, i.e. catalyzing thetransfer of a galactose from a UDP-galactose donor onto alipidic diacylglycerol acceptor) and bacterial proteins having

Page 16: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 425

a distinct but related activity (MURG, i.e. catalyzing thetransfer of an N-acetyl-glucosamine from a UDP-N-acetyl-glucosamine donor onto the Lipid 1 acceptor). The obtainedcluster is consistent with the common origin of MGD andMURG sequences and their 3D structural similarity [72].The TULIP tree highlights the clear evolutionary divergencebetween these groups of sequences. Fig. 4b and c show theTULIP trees obtained after a re-computation of the Z-valuetable with 100 and 2000 sequence randomizations. Althoughthe accuracy of the tree is improved, the reliability of theinitial clusters and derived phylogeny shown in Fig. 4a, asestimated from the refined tree shown in Fig. 4b, illustrates thatthe information collected in the original massive comparisoncould be extracted with a good level of confidence. The TULIPtree constructed from the TeraProt database was obtained after<15 s using a bench workstation (HP ProLiant G4 Bi-Xeon—2.8 GHz) whereas trees reconstructed with 100 shufflingrequired 6 min computation and reconstruction with 2000shuffling required less than 2 h using the bench workstation.This example illustrates the power of the method and thebenefit one can get by exploring massive comparison resultsusing the CSHP projection and TULIP phylogenetic underlyingtopology.

Because previous massive comparisons based on Z-valuestatistics, such as TeraProt, were not specifically designedto be processed following a CSHP projection and TULIPphylogenetic reconstruction, we noticed that the outputZ-value table could not be fully subjected to this post-analyticaltreatment. On the one hand, multiple occurrences of someidentical sequences introduced redundant Z-values had to befiltered and, on the other hand, missing Z-values had to becorrected by default values. We are currently designing aninternet-portal allowing the exploration of protein space(s)based on curated massive comparison tables.

6. Conclusion

The proportion of protein sequences in public databases thatwill be of high-quality annotation will fall below 1% in less than3 years, and the total number of proteins in the virtual proteomewill soon exceed 10 million. The need for a standard theoreticaland practical projection of the protein space is urgent. Thereduction of the virtual proteome from an all-by-all sequencecomparison seem to include the use of odds ratio-derived aminoacid scoring matrices (such as BLOSUM), the exact and CPUdemanding SW algorithm, and statistics based on Z-values.One of our future goals is to compare the performance andaccuracy of BLASTP and SW based on Z-value statistics,which might be determinant in selecting the alignment methodfor future massive comparison projects. The use of Z-valuesis extremely important to embrace the diversity of the proteinworld, and particularly proteins that diverge from the averageamino acid distribution (compositionally biased [35,36]). TheCSHP is obtained by a rigorous projection of proteins,respecting biological constraints and being conservative formutual information shared by sequences [7]. It can be assigneda topology based on a protein evolution model that allows a

construction of phylogenetic trees, named TULIP trees [7].TULIP trees have numerous advantages over current clusteringmethods, including their stability and explicit Darwinianconstruction that makes sense to a bench biologist. The criticalneed of evolutionary consistent protein sequence navigationmaps and the lessons one can learn from the analysis of theprevious massive comparisons, presented in this paper, led usto investigate how the CSHP projection could help in providinga theoretically robust and practical solution. In this paper,we showed that the CSHP projection can be used to processthe outputs of current massive comparison projects (TeraProtresults can be explored at the TULIP 1.1 server that willsoon be launched, but the CSHP projection and TULIP treecluster representation can also be an important improvementfor other Z-value based systematic massive comparisonssuch as ClusTr) given minor corrections for uncollected lowZ-values in the database. The use of massive comparisonoutputs are not restricted to phylogenic explorations, and canhelp automatic massive annotations by homology transfers,which require other clustering methods, including speciesspecific occurrence profiles. The CSHP projection is thereforeproposed as a sequence-based underlying organization ofsequences that should be linked to other protein classification(particularly based on sub-domains and available tertiarystructures) and knowledge bases.

References

[1] M. Margulies, M. Egholm, W.E. Altman, S. Attiya, J.S. Bader, L.A.Bemben, J. Berka, M.S. Braverman, Y.J. Chen, Z. Chen, S.B. Dewell,L. Du, J.M. Fierro, X.V. Gomes, B.C. Godwin, W. He, S. Helgesen, C.H.Ho, G.P. Irzyk, S.C. Jando, M.L. Alenquer, T.P. Jarvie, K.B. Jirage, J.B.Kim, J.R. Knight, J.R. Lanza, J.H. Leamon, S.M. Lefkowitz, M. Lei, J. Li,K.L. Lohman, H. Lu, V.B. Makhijani, K.E. McDade, M.P. McKenna, E.W.Myers, E. Nickerson, J.R. Nobile, R. Plant, B.P. Puc, M.T. Ronan, G.T.Roth, G.J. Sarkis, J.F. Simons, J.W. Simpson, M. Srinivasan, K.R. Tartaro,A. Tomasz, K.A. Vogt, G.A. Volkmer, S.H. Wang, Y. Wang, M.P. Weiner,P. Yu, R.F. Begley, J.M. Rothberg, Genome sequencing in microfabricatedhigh-density picolitre reactors, Nature 437 (7057) (2005) 376–380.

[2] An example of an in silico docking can be accessed at http://wisdom.eu-egee.fr/.

[3] H.J. Feldman, M. Dumontier, S. Ling, N. Haider, C.W. Hogue, CO: Achemical ontology for identification of functional groups and semanticcomparison of small molecules, FEBS Lett. 579 (21) (2005) 4685–4691.

[4] A. Bairoch, R. Apweiler, C.H. Wu, W.C. Barker, B. Boeckmann, S. Ferro,E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M.J. Martin, D.A. Natale,C. O’Donovan, N. Redaschi, L.S. Yeh, The universal protein resource(UniProt), Nucleic Acids Res. 33 (Database issue) (2005) D154–D159.

[5] Y. Ofran, M. Punta, R. Schneider, B. Rost, Beyond annotation transferby homology: Novel protein-function prediction methods to assist drugdiscovery, Drug Discov. Today 10 (21) (2005) 1475–1482.

[6] J. Liu, B. Rost, Domains, motifs and clusters in the protein universe, Curr.Opin. Chem. Biol. 7 (1) (2003) 5–11.

[7] O. Bastien, P. Ortet, S. Roy, E. Marechal, A configuration space ofhomologous proteins conserving mutual information and allowing aphylogeny inference based on pair-wise Z-score probabilities, BMCBioinformatics 6 (1) (2005) 49.

[8] R.V.L. Hartley, Transmission of information, The Bell Syst. Tec. J. 3(1928) 535–564.

[9] C.E. Shannon, A mathematical theory of communication, The Bell Syst.Tec. J. 27 (1948) 379–423.

[10] Access to the genome database for Human at http://gdbwww.gdb.org/.

Page 17: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

426 O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427

[11] Access to the genome database for Arabidopsis athttp://www.arabidopsis.org/.

[12] Access to the genome database for Plasmodium athttp://www.plasmodb.org/.

[13] EMBL genome sequence repository at the European BioinformaticInstitute web service: http://www.ebi.ac.uk/embl/.

[14] GenBank genome sequence repository at the National Center for Biotech-nology Information web service: http://www.ncbi.nlm.nih.gov/entrez/.

[15] DDBJ genome sequence repository at the DNA Data Bank of Japan:http://www.ddbj.nig.ac.jp/.

[16] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, D.L. Wheeler,GenBank, Nucleic Acids Res. 33 (Database issue) (2005) D34–D38.

[17] C.H. Wu, L.S. Yeh, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen,Z. Hu, P. Kourtesis, R.S. Ledley, B.E. Suzek, C.R. Vinayaka, J. Zhang,W.C. Barker, The protein information resource, Nucleic Acids Res. 31 (1)(2003) 345–347.

[18] B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher,E. Gasteiger, M.J. Martin, K. Michoud, C. O’Donovan, I. Phan, S.Pilbout, M. Schneider, The SWISS-PROT protein knowledgebase and itssupplement TrEMBL in 2003, Nucleic Acids Res. 31 (1) (2003) 365–370.

[19] M.J. Gardner, N. Hall, E. Fung, O. White, M. Berriman, R.W. Hyman,J.M. Carlton, A. Pain, K.E. Nelson, S. Bowman, I.T. Paulsen, K. James,J.A. Eisen, K. Rutherford, S.L. Salzberg, A. Craig, S. Kyes, M.S. Chan,V. Nene, S.J. Shallom, B. Suh, J. Peterson, S. Angiuoli, M. Pertea,J. Allen, J. Selengut, D. Haft, M.W. Mather, A.B. Vaidya, D.M. Martin,A.H. Fairlamb, M.J. Fraunholz, D.S. Roos, S.A. Ralph, G.I. McFadden,L.M. Cummings, G.M. Subramanian, C. Mungall, J.C. Venter, D.J.Carucci, S.L. Hoffman, C. Newbold, R.W. Davis, C.M. Fraser, B. Barrell,Genome sequence of the human malaria parasite plasmodium falciparum,Nature 419 (6906) (2002) 498–511.

[20] M.S. Waterman, Introduction to Computational Biology: Maps, Se-quences, and Genomes, CRC Press, 1995.

[21] J. Setubal, J. Meidanis, Introduction to Computational Molecular Biology,PWS Publishing Company, Boston, 1997.

[22] M.O. Dayhoff, W.C. Barker, L.T. Hunt, Establishing homologies inprotein sequences, Method Enzymol. 91 (1983) 524–545.

[23] S. Henikoff, J.G. Henikoff, Amino acid substitution matrices from proteinblocks, Proc. Natl Acad. Sci. USA 89 (22) (1992) 10915–10919.

[24] J.L. Risler, M.O. Delorme, H. Delacroix, A. Henaut, Amino acidsubstitutions in structurally related proteins. A pattern recognitionapproach. Determination of a new and efficient scoring matrix, J. Mol.Biol. 204 (1988) 1019–1029.

[25] O. Bastien, S. Roy, E. Marechal, Construction of non-symmetricsubstitution matrices derived from proteomes with biased amino aciddistributions, C. R. Biol. 328 (5) (2005) 445–453.

[26] S. Henikoff, J.G. Henikoff, Performance evaluation of amino acidsubstitution matrices, Proteins 17 (1) (1993) 49–61.

[27] S.B. Needleman, C.D. Wunsch, A general method applicable to the searchfor similarities in the amino acid sequence of two proteins, J. Mol. Biol.48 (3) (1970) 443–453.

[28] T.F. Smith, M.S. Waterman, Identification of common molecularsubsequences, J. Mol. Biol. 147 (1) (1981) 195–197.

[29] W.R. Pearson, D.J. Lipman, Improved tools for biological sequencecomparison, Proc. Natl Acad. Sci. USA 85 (8) (1988) 2444–2448.

[30] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic localalignment search tool, J. Mol. Biol. 215 (3) (1990) 403–410.

[31] E. Glemet, J.J. Codani, LASSAP, a LArge Scale Sequence compArisonPackage, Comput. Appl. Biosci. 13 (2) (1997) 137–143.

[32] T. Rognes, E. Seeberg, Six-fold speed-up of Smith–Waterman sequencedatabase searches using parallel processing on common microprocessors,Bioinformatics 16 (8) (2000) 699–706.

[33] A. YarKhan, J. Dongarra, Biological sequence alignment on thecomputational grid using the GrADS framework, Future Gener. Comput.Syst. 21 (6) (2005) 980–986.

[34] S. Karlin, S.F. Altschul, Methods for assessing the statistical significanceof molecular sequence features by using general scoring schemes, Proc.Natl Acad. Sci. USA 87 (6) (1990) 2264–2268.

[35] O. Bastien, J.C. Aude, S. Roy, E. Marechal, Fundamentals ofmassive automatic pairwise alignments of protein sequences: Theoreticalsignificance of Z-value statistics, Bioinformatics 20 (4) (2004) 534–537.

[36] O. Bastien, S. Lespinats, S. Roy, K. Metayer, B. Fertil, J.J. Codani,E. Marechal, Analysis of the compositional biases in plasmodiumfalciparum genome and proteome using Arabidopsis thaliana as areference, Gene 336 (2) (2004) 163–173.

[37] D.J. Lipman, W.R. Pearson, Rapid and sensitive protein similaritysearches, Science 227 (4693) (1985) 1435–1441.

[38] W.M. Fitch, Random sequences, J. Mol. Biol. 163 (2) (1983) 171–176.[39] J.P. Comet, J.C. Aude, E. Glemet, J.L. Risler, A. Henaut, P.P. Slonimski,

J.J. Codani, Significance of Z-value statistics of Smith–Waterman scoresfor protein alignments, Comput Chem. 23 (3–4) (1999) 317–331.

[40] J.N. Bacro, J.P. Comet, Sequence alignment: An approximation law forthe Z-value with applications to databank scanning, Comput Chem. 25 (4)(2001) 401–410.

[41] J.C. Aude, A. Louis, An incremental algorithm for Z-value computations,Comput Chem. 26 (5) (2002) 403–411.

[42] R.L. Tatusov, E.V. Koonin, D.J. Lipman, A genomic perspective onprotein families, Science 278 (5338) (1997) 631–637.

[43] R.L. Tatusov, D.A. Natale, I.V. Garkavtsev, T.A. Tatusova, U.T.Shankavaram, B.S. Rao, B. Kiryutin, M.Y. Galperin, N.D. Fedorova,E.V. Koonin, The COG database: New developments in phylogeneticclassification of proteins from complete genomes, Nucleic Acids Res. 29(1) (2001) 22–28.

[44] R.L. Tatusov, N.D. Fedorova, J.D. Jackson, A.R. Jacobs, B. Kiryutin, E.V.Koonin, D.M. Krylov, R. Mazumder, S.L. Mekhedov, A.N. Nikolskaya,B.S. Rao, S. Smirnov, A.V. Sverdlov, S. Vasudevan, Y.I. Wolf, J.J. Yin,D.A. Natale, The COG database: An updated version includes eukaryotes,BMC Bioinformatics 4 (2003) 41.

[45] Access to COG protein clusters at http://www.ncbi.nlm.nih.gov/COG/.[46] A.J. Enright, V. Kunin, C.A. Ouzounis, Protein families and TRIBES in

genome sequence space, Nucleic Acids Res. 31 (15) (2003) 4632–4638.[47] TribeMCL downloadable at http://www.ebi.ac.uk/research/cgg/tribe/.[48] O. Sasson, A. Vaaknin, H. Fleischer, E. Portugaly, Y. Bilu, N. Linial,

M. Linial, ProtoNet: Hierarchical classification of the protein space,Nucleic Acids Res. 31 (1) (2003) 348–352.

[49] N. Kaplan, O. Sasson, U. Inbar, M. Friedlich, M. Fromer, H. Fleischer,E. Portugaly, N. Linial, M. Linial, ProtoNet 4.0: A hierarchicalclassification of one million protein sequences, Nucleic Acids Res. 33(Database issue) (2005) D216–D218.

[50] Access to ProtoNet massive comparison output athttp://www.protonet.cs.huji.ac.il/.

[51] G. Yona, N. Linial, M. Linial, ProtoMap: Automatic classification ofprotein sequences and hierarchy of protein families, Nucleic Acids Res.28 (1) (2000) 49–55.

[52] G. Yona, N. Linial, M. Linial, ProtoMap: Automatic classification ofprotein sequences, a hierarchy of protein families, and local maps of theprotein space, Proteins 37 (3) (1999) 360–378.

[53] Access to ProtoMap protein clusters at http://protomap.cornell.edu/.[54] R. Arnold, T. Rattei, P. Tischler, M.D. Truong, V. Stumpflen, W. Mewes,

SIMAP—The similarity matrix of proteins, Bioinformatics 21 (Suppl. 2)(2005) ii42–ii46.

[55] Access to SIMAP massive comparison out-put at http://mips.gsf.de/services/analysis/simap/ andhttp://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl.

[56] A. Krause, J. Stoye, M. Vingron, Large scale hierarchical clustering ofprotein sequences, BMC Bioinformatics 6 (1) (2005) 15.

[57] A. Krause, M. Vingron, A set-theoretic approach to database searchingand clustering, Bioinformatics 14 (5) (1998) 430–438.

[58] Access to SYSTERS V4 protein clusters at http://systers.molgen.mpg.de/.[59] Access to Decrypthon massive comparison output at

http://www.infobiogen.fr/services/decrypthon/decrypthon gb.htm.[60] Access to TeraProt massive comparison output at

http://www.infobiogen.fr/services/Teraprot/.[61] J.C. Aude, Y. Diaz-Lazcoz, J.J. Codani, J.L. Risler, Applications of the

pyramidal clustering method to biological objects, Comput Chem. 23(3–4) (1999) 303–315.

Page 18: The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

O. Bastien et al. / Future Generation Computer Systems 23 (2007) 410–427 427

[62] A. Louis, E. Ollivier, J.C. Aude, J.L. Risler, Massive sequencecomparisons as a help in annotating genomic sequences, Genome Res.11 (1) (2001) 1296–1303.

[63] Access to PhytoProt protein clusters at http://genoplante-info.infobiogen.fr/phytoprot/.

[64] R. Apweiler, M. Biswas, W. Fleischmann, A. Kanapin, Y. Karavi-dopoulou, P. Kersey, E.V. Kriventseva, V. Mittard, N. Mulder, I. Phan,E. Zdobnov, Proteome analysis database: Online application of InterProand CluSTr for the functional classification of proteins in whole genomes,Nucleic Acids Res. 29 (1) (2001) 44–48.

[65] E.V. Kriventseva, F. Servant, R. Apweiler, Improvements to CluSTr: Thedatabase of SWISS-PROT+TrEMBL protein clusters, Nucleic Acids Res.31 (1) (2003) 388–389.

[66] R. Petryszak, E. Kretschmann, D. Wieser, R. Apweiler, The predictivepower of the CluSTr database, Bioinformatics 21 (18) (2005)3604–3609.

[67] Access to CluSTr protein clusters at http://www.ebi.ac.uk/clustr/.[68] L. Brocchieri, Phylogenetic inferences from molecular sequences: Review

and critique, Theor. Popul. Biol. 59 (1) (2001) 27–40.[69] E. Zuckerkandl, The evolution of hemoglobin, Sci. Am. 212 (1965)

110–118.[70] J. Felsenstein, PHYLIP—Phylogeny inference package (Version 3.2),

Cladistics 5 (1989) 164–166.[71] H.A. Schmidt, K. Strimmer, M. Vingron, A. von Haeseler, TREE-

PUZZLE: Maximum likelihood phylogenetic analysis usingquartets and parallel computing, Bioinformatics 18 (3) (2002)502–504.

[72] C. Botte, C. Jeanneau, L. Snajdrova, O. Bastien, A. Imberty, C. Breton,E. Marechal, Molecular modeling and site-directed mutagenesis of plantchloroplast monogalactosyldiacylglycerol synthase reveal critical residuesfor activity, J. Biol. Chem. 280 (41) (2005) 34691–34701.

Olivier Bastien received two B.S. degrees inPopulation Biology (1999) and Cell Biology andPhysiology (2001) from the University of Grenoble,France and a M.S. in Biomathematics (2002) fromthe University of Paris VI. He received a Ph.D.in Biomathematics from the University of Grenoble(2006) in the Plant Physiology Laboratory, at theCommissariat a l’Energie Atomique, Grenoble, and atGene-IT (GenomeQuest), Rueil Malmaison, France.

He coauthored several papers in the area of biomathematics, biology andbioinformatics. His Ph.D. project focuses on the development of new numericalapproaches for the comparison of compositionally biased sequences in thefield of malaria genomics, using parallelized algorithms. He designed massiveprotein sequence comparison experiments, carried out using the CEA TERAsupercomputer, Bruyere-Le-Chatel. He designed the configuration space ofhomologous proteins using concepts from information theory and theoreticalphysics. He developed statistical methods to assist the analysis of bi-dimensional protein electrophoresis (proteomics) and software for capillaryelectrophoresis. Current research interests include genome annotation theories,phylogenomics, automatic homology prediction, statistics for sequenceanalyses, and epigenetic models.

Philppe Ortet received his M.S. (1995) in computerengineering from the University of Paris VI. In 1997,he joined the Commissariat a l’Energie Atomique inthe technical and scientific information departmentand, in 2001, the Department of Plant Ecophysiologyand Microbiology, Cadarache, France, where hewas in charge of information systems, algorithmicand developments for local teams of biologists. Hecoauthored several papers in the area of integrated

biology and bioinformatics. He developed tools to assist the analysis ofmicroarrays (DNA chips). He recently designed the bioinformatic workflow andtools (GenoBrowser) for the complete annotation of bacterial genomes. Currentresearch interests include complete genome annotation, phylogenomics,automatic homology prediction and sequence analyses.

Sylvaine Roy received a M.S. (1984) in Biochemical En-gineering from the Institut National des Sciences Ap-pliquees (Engineering University), Lyon, France and aM.S. (1986) in Computer Science from Ecole NationaleSuperieure d’Informatique et de Mathematiques Ap-pliquees, Grenoble, France. She worked (1987–1991) inthe research and development software laboratory of theHewlett Packard Company, Grenoble, France. In 1991,she joined the Commissariat a l’Energie Atomique in the

Life Science Division, Grenoble, France, where she was in charge of net-work security, information systems, algorithmic and software developments forlocal teams of biologists. She coauthored several papers in the area of biol-ogy and bioinformatics. She recently managed software development projectsfor the analysis and integration of automatic molecular screening data. Hercurrent research interests include chemogenomics, artificial intelligence,machine-learning, data semantic and data integration.

Eric Marechal is currently in charge of the Compar-ative Chemogenomics team in the Plant PhysiologyLaboratory at the Commissariat a l’Energie Atomique,Grenoble, France. He received an Agregation teachingdegree in Life Science (1990) from the Ministry of Na-tional Education, a M.S. (1991) in Cell Biology fromthe Ecole Normale Superieure de Lyon, France, and aPh.D. (1994) in Molecular and Cell Biology from theUniversity of Grenoble, France. From 1994 to 1997

he joined the Nam-Hai Chua Laboratory at the Rockefeller University, NewYork, USA, as a Human Frontier post-doctoral fellow. After a short period atRhone–Poulenc Industrialisation, Lyon, France, he joined the CNRS in 1998.Since 1998, he has focused on the plant features discovered in apicomplexanparasites (a group of species including the malaria parasite Plasmodium fal-ciparum), taking this opportunity to start target and drug discovery projects,in collaboration with the Cerep and Gene-IT Companies, Rueil Malmaison,France. He coauthored several papers in biochemistry, cell biology, plant physi-ology, toxoplasmosis and malaria research, biomathematics, bioinformatics andchemogenomics. His scientific interests include chemogenomics, from phar-macological high-throughput screening to drug candidate development, com-parative genomics of plant and apicomplexans and phylogenomics, from high-throughput protein phylogeny to comparative physiology.