14

Click here to load reader

On random splitting of the interval

Embed Size (px)

Citation preview

Page 1: On random splitting of the interval

Statistics & Probability Letters 66 (2004) 237–250

On random splitting of the interval

Javiera Barreraa, Thierry Huilletb;∗

aDepto. Ingenier� a Matem�atica and Centro Modelamiento Matem�atico, UMR 2071 UCHILE-CNRS,Casilla 170-3 Correo 3, Santiago, Chile

bLaboratoire de Physique Th�eorique et Mod�elisation, CNRS-UMR 8089 et Universit�e de Cergy-Pontoise,5 mail Gay-Lussac, Neuville sur Oise 95031, France

Received July 2003

Abstract

We study the colonizing process of space by some populations which can be verbally described as follows:suppose a .rst incoming species occupies a random fraction of the available unit space. The forthcomingspecies takes an independent random fraction of the remaining space. There are n species and so there is afraction of space occupied by no species. This model constitutes an approximation to the celebrated GEMinterval partition.

Essentially using moments, we study some statistical features of the induced partition structure of space.c© 2003 Elsevier B.V. All rights reserved.

MSC: primary 60G57; 62E17; secondary 60K99; 62E15; 62E20

Keywords: Random splitting of the interval; Ranking; Size-biased sampling; Stick breaking; Partition structure

1. Introduction: the broken stick model

Let us here introduce the random splitting model of the interval into n fragments and some of itsrelatives (see Feller, 1971, p. 25).

Consider an interval I of unit length, |I| = 1. Let U1; : : : ; Un be independent and identicallydistributed, say iid, and uniform. With ?Um := 1 − Um, let

Sm =m−1∏k=1

?UkUm; m = 1; : : : ; n (1)

∗ Corresponding author.E-mail addresses: [email protected] (J. Barrera), [email protected] (T. Huillet).

0167-7152/$ - see front matter c© 2003 Elsevier B.V. All rights reserved.doi:10.1016/j.spl.2003.10.015

Page 2: On random splitting of the interval

238 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

be the consecutive spacings, i.e. the fractions of space occupied by each species in a sequentialcolonizing process involving n species (with by convention S1 = U1). Then

Sn+1 := 1 −n∑

m=1

Sm =n∏

m=1

?Um (2)

is free (or vacant) space, i.e. the fraction of space not occupied by any species. We shall writeSn := (Sm; m = 1; : : : ; n + 1).

From this model, we have the stochastic domination properties: S1

d¡ · · ·

d¡ Sn

d= Sn+1 (where

S1

d¡ S2 in the usual sense that P(S1 ¿s)¿P(S2 ¿s) for all s∈ [0; 1]).The generalized Engen–McCloskey (or GEM) limit:As n ↑ ∞, Sn+1

a:s:→ 0 and this model converges weakly to the GEM(1)-model for which, with(Um;m¿ 1) iid and uniform:

m =m−1∏k=1

?UkUm; m¿ 1 (3)

constitutes a countable partition of the unit interval I, in the residual allocation model (RAM) class.This model has many remarkable invariance properties among which:

(1) The GEM(1)-model is invariant under size-biased permutation (Sibuya and Yamato, 1995;Pitman, 1996; Yamato et al., 2001).

(2) The ranked values of (m; m¿ 1), say (↓m; m¿ 1), with ↓1 ¿ · · ·¿↓m¿ · · ·, have Poisson–Dirichlet distribution PD(1) and a size-biased sampling of (↓m; m¿ 1) has GEM(1)-distribution(Pitman and Yor, 1997), i.e. is representable as a RAM as in (3). Furthermore 1=↓1 has Dickman’sdistribution (see Holst, 2001) and em↓m → 1 almost surely as m ↑ ∞ (Kingman, 1993, p. 96).

(3) The PD(1) distribution is invariant under the insert and delete (Pitman and Yor, 1997) andthe split and merge transformations (Pitman, 2002).

For biological motivations and further properties of this model with in.nitely many species (seeTavarLe and Ewens (1997) and the bibliography therein).

In the sequel, we shall rather study some aspects of the model Sn de.ned by (1) and (2), togetherwith some questions related to ranking and size-biased sampling of it. Model (1)–(2) constitutes anapproximation of model (3) and should be regarded as such.

In Holst (2001), Kingman (1993, pp. 93–94), Patil and Taillie (1977) and Donnelly (1986),another related partition S̃n of I into n + 1 fragments was considered: in this partition model, thecomponents of S̃n are distributed according to the (exchangeable) Dirichletn+1(�) density functionon the simplex that is to say

fS̃1 ;:::;S̃n+1(s1; : : : ; sn+1) =

�((n + 1)�)�(�)n+1

n+1∏m=1

s�−1m · �(∑ n+1

m=1 sm−1): (4)

In this case, S̃md= S̃(n), m = 1; : : : ; n + 1, independently of m and all components of S̃n are thus

identically distributed (id). Their common density on the interval I is now given by

fS̃(n)(s) =

� ((n + 1) �)� (�)� (n�)

s�−1 (1 − s)n�−1 ;

Page 3: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 239

a beta(�; n�) density. Consider next the sequence S̃↓n := (S̃↓m;n; m=1; : : : ; n+1) obtained while ranking

the spacings’ vector S̃n according to descending sizes, hence with S̃↓1; n ¿ · · ·¿S̃↓m;n ¿ · · ·¿S̃↓n+1; n.Then (S̃↓m;n; m = 1; : : : ; n + 1) converges in law to (↓m; m¿ 1) in the Kingman limit: n ↑ ∞, � ↓ 0while n� = 1. Note from (4) that (S̃m; m = 1; : : : ; n + 1) itself does not converge, accordingly.

2. Random splitting: a statistical study

In this section, we study some statistical aspects of the model speci.ed by (1) and (2). We shallmake some use of moments, which, in the context of partitioning of the interval, is quite standard(see Johnson and Kotz, 1995).

2.1. The joint distribution of Sm; m = 1; : : : ; n and their geometrical average

The random vector of spacings (Sm; m = 1; : : : ; n) is not exchangeable. It has density

fS1 ;:::;Sn(s1; : : : ; sn) =1∏n

m=1

(1 −∑m

k=1 sk) I(∑ n

m=1 sm¡1): (5)

Alternatively,

Lemma 1. The joint moment function of Sm; m = 1; : : : ; n is

E

[n∏

m=1

Sqmm

]=

1�(1 + q1 + · · · + qn)

n∏m=1

�(1 + qm)1 + qm + · · · + qn

: (6)

Proof. We have

E

[n∏

m=1

Sqmm

]=E

(n∏

m=1

m−1∏k=1

?Uqmk Uqm

m

)=

n∏m=1

E[Uqmm

?Uqm+1+···+qnm ]

=n∏

m=1

�(1 + qm)�(1 + qm+1 + · · · + qn)�(2 + qm + · · · + qn)

=1

�(1 + q1 + · · · + qn)

n∏m=1

�(1 + qm)1 + qm + · · · + qn

:

This result allows to obtain some statistical insight on the geometrical average of the Sm; m =1; : : : ; n and on its behavior as n increases. Indeed, putting qm = q=n, m = 1; : : : ; n, we obtain themoment function (q¿− 1) of

∏nm=1 S

1=nm as

E

[(n∏

m=1

S1=nm

)q]=�(1 + q=n)n

�(1 + q)

n∏m=1

11 + mq=n

=�(1 + q=n)n

�(1 + q)(q=n)n�(1 + n=q)

�(1 + n + n=q)if q¿ 0: (7)

Page 4: On random splitting of the interval

240 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

From Stirling’s formula, with C the Euler’s constant

E

[(n∏

m=1

S1=nm

)q]∼n↑∞

e−Cq

�(1 + q)(1 + q)1=2 (e(1 + q)−(1+1=q))n; q¿ 0:

Furthermore, if q¿− 1, we have

− 1n

log

[n∏

m=1

11 + mq=n

]→n↑∞

∫ 1

0log(1 + xq) dx = (1 + 1=q) log(1 + q) − 1:

So, from (7), with q¿− 1, we have the pointwise convergence

− 1n

logE

[(n∏

m=1

S1=nm

)q]→n↑∞

F(q) := (1 + 1=q) log(1 + q) − 1;

with F(0) = 0, F ′(q) →|q|→0 1=2. Standard results on large deviations yield

Proposition 2. Almost surelyn∏

m=1

S1=nm →

n↑∞e−F′(0) = 0:6065: (8)

and

1n

logP

(n∏

m=1

S1=nm ¿ e−x

)→n↑∞

f(x); x¿ 12 ;

1n

logP

(n∏

m=1

S1=nm 6 e−x

)→n↑∞

f(x); 0¡x6 12 ;

where f(x) = inf q¿−1 (qx − F(q))6 0 is the concave Legendre transform of F(q) with f( 12) = 0.

Finally, the following result concerning the geometrical average of the Sm; m = 1; : : : ; n holds

Lemma 3. Let Dn(1) := (D1; : : : ; Dn) be a random vector with Dirichletn(1) distribution. LetB1; n−1

d∼ beta(1; n− 1) be a random variable independent of∏n

m=1 S1=nm . With Bn=m;1

d∼ beta(n=m; 1),m = 1; : : : ; n, we have

B1; n−1

n∏m=1

S1=nm

d=n∏

m=1

D1=nm

n∏m=1

Bn=m;1: (9)

Here∏n

m=1 D1=nm and (Bn=m;1; m = 1; : : : ; n) are mutually independent.

Proof. With � := �1; : : : ; �n ¿ 0, let Dn(�) := (D1; : : : ; Dn) be a random vector with Dirichletn(�)distribution that is with joint moment function

E

[n∏

m=1

Dqmm

]=

� (�1 + · · · + �n)� (�1 + q1 + · · · + �n + qn)

n∏m=1

� (�m + qm)� (�m)

:

Page 5: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 241

Putting �m = 1, qm = q=n, m = 1; : : : ; n, we have

E

[(n∏

m=1

D1=nm

)q]=

�(n)�(n + q)

�(1 + q=n)n:

Next�(1 + q=n)n

�(1 + q)=

�(n)�(1 + q=n)n

�(n + q)�(n + q)

�(1 + q)�(n)

with �(1+q)�(n)=�(n+q) the moment function of a beta(1; n−1) random variable. Finally, 1=(1+mq=n) is the moment function of a beta(n=m; 1) random variable. From (7) the result follows.

2.2. The partition function of the random splitting model

Let Zn( ) :=∑n+1

m=1 S m, ¿−1, be the (random) partition function of the random splitting model

Sn. The range of this random variable is [(n+ 1)1− ; 1] if ¿ 1, [1; (n+ 1)1− ] if ¡ 1. We have

Proposition 4. If ¿ 0, as n ↑ ∞Zn( ) d→Z∞( ) (10)

with Z∞( )∈ (0; 1) if ¿ 1, Z∞( )∈ (1;∞) if ∈ (0; 1).The limiting random variable Z∞( ) is characterized by

Z∞( ) d=U 1 + ?U

1 · Z ′∞( ); (11)

with U1 uniform and independent of Z ′∞( ) d=Z∞( ).

Proof. We have Z1( ) = U 1 + ?U

1 and, with n¿ 2

Zn( ) = U 1 + ?U

1(U 2 + ?U

2U 3 + · · · + ?U

2 · · · ?U n)

d=U 1 + ?U

1 · Z ′n−1( );

where Z ′n−1( ) d=Zn−1( ) is independent of U1. If ¿ 0, Zn( ) d→Z∞( ) where Z∞( ) d=U

1 + ?U 1 ·

Z ′∞( ). Z∞( ) is a random variable whose integral p-moments are recursively de.ned by

E[Z∞( )p] =: mp( ) =p∑

k=0

(p

k

)E[U (p−k)

1?U k

1 ]mk( ); p¿ 1

with E[U (p−k)1

?U k1 ] = �((p− k) + 1)�(k + 1)=�(p + 2).

In particular, m1( )=1= , m2( )=(1=2 )[1+�( )2=�(2 )] are its .rst and second moments.

Some direct consequences of this proposition are(1) at = 1, m1(1) = 1 and m2(1) − m1(1)2 = 0 (expressing that Zn(1) = 1).(2) if ∈ (−1; 0), [EZn( )]1=n →n↑∞ 1=(1 + ): Zn( ) does not converge.(3) When ¿ 0, the full "-moment function E[Z∞( )"] of Z∞( ) can be obtained, raising both

hand sides of (11) to the power ", developing the right-hand side in a "-power series to obtain

E[Z∞( )"] =∑

p1 ;p2¿0

(")p1+p2

E[(U 1 − 1)p1 ?U p2

1 ]p1!p2!

mp2( )

Page 6: On random splitting of the interval

242 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

with (")p = "("− 1) · · · ("− p + 1), (")0 := 1 and

E[(U 1 − 1)p1 ?U p2

1 ] =p1∑r=0

(−1)p1−r

(p1

r

)E[U r

1?U p2

1 ]

=p1∑r=0

(−1)p1−r

(p1

r

)�(r + 1)�(p2 + 1)�((r + p2) + 2)

:

Putting " = q= , q¿ 0, also gives E[Z∞( )q= ] which is the limiting q-moment function of the

-norm of Sn: Zn( )1= =[∑n+1

m=1 S m

]1= ∈ [(n + 1)(1− )= ; 1], assuming ¿ 1.

(4) A related random quantity of interest is RLenyi’s -average of fragments sizes Sn which is[n+1∑m=1

S +1m

]1=

=: 〈Sn〉 ; ∈R:

The range of this random variable is [1=(n + 1); 1] for any ∈ (−1;∞) and [0; 1=(n + 1)], for any ∈ (−∞;−1). Note that, with S↓1; n := S1 ∨ · · ·∨Sn+1 the largest spacing and S↓n+1; n := S1 ∧ · · ·∧Sn+1

the smallest, 〈Sn〉 → ↑∞ S↓1; n, 〈Sn〉 → ↓−∞ S↓n+1; n, almost surely. For this random variable,we obtain

Corollary 5. If ¿− 1, �= 0, 〈Sn〉 d→n↑∞

〈S∞〉 ∈ (0; 1) and with q¿ 0

E[〈S∞〉q ] =∑

p1 ;p2¿0

(q

)p1+p2

E[(U +11 − 1)p1 ?U ( +1)p2

1 ]p1!p2!

mp2( + 1)

is the q-moment function of 〈S∞〉 .

2.3. The one-dimensional law of Sm; limit law for vacant space

Let us now consider the one-dimensional distributions. Clearly Smd∼ exp−gamma(m), m=1; : : : ; n,

are log-gamma distributed with density and distribution functions respectively given by (see Feller,1971; Schultz and Crouse, 1973):

fSm(s) =1

(m− 1)!(− log s)m−1; (12)

FSm(s) = sm−1∑k=0

1k!

(− log s)k ; s∈ (0; 1); m = 1; : : : ; n: (13)

The Sm distribution on the interval (0; 1) is also characterized by its moment function ESqm=(1+q)−m,q¿− 1, with mean value ESm = 2−m, m = 1; : : : ; n.

Note that free space distribution is characterized by: ESqn+1 =(1+q)−n, q¿−1, with mean value

ESn+1 = 2−n. From this, observing that Sn+1d= Sn = 1 − ?Sn where ?Sn :=

∑nm=1 Sm is the amount of

space occupied by the n .rst species, we get, applying (a multiplicative version of) the law of largenumbers and the CL Theorem.

Page 7: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 243

Proposition 6. Vacant space (or amount of space occupied by the last incoming species)

S1=nn+1 (or S1=n

n ) a:s:→ e−1; [eS1=nn+1]

√n (or [eS1=n

n ]√n) d→L: (14)

Space occupied by the n Drst species

(1 − ?Sn)1=n a:s:→ e−1; [e(1 − ?Sn)1=n]√n d→L; (15)

where L has lognormal distribution.

2.4. Ranking: the law of the smallest and largest spacings

Consider next the sequence (S↓m;n; m = 1; : : : ; n + 1) obtained while ranking the spacings’ vectorSn according to descending sizes, including free space, hence with S↓1; n ¿ · · ·¿S↓n+1; n. The S↓m;ndistribution can easily be found. In particular,

Proposition 7. (i) Let ?FS↓n+1; n

(s) := P(S↓n+1; n ¿ s). Then?FS↓

2; 1(s) = (1 − 2s); s∈ (0; 1=2) (16)

and with n¿ 2

?FS↓n+1; n

(s) =∫ 1−ns

s

?FS↓n; n−1

(s

1 − u

)du; s∈

(0;

1n + 1

); (17)

deDnes the smallest spacings’ distribution with n species by recurrence.(ii) Let FS↓

1; n(s) := P(S↓1; n6 s). Then

FS↓1; 1

(s) = (2s− 1); s∈ (1=2; 1) (18)

and with n¿ 2

FS↓1; n

(s) =∫ s

(1−ns)+

FS↓1; n−1

(s

1 − u

)du; s∈

(1

n + 1; 1); (19)

deDnes the largest spacings’ distribution with n species by recurrence.

Proof. We prove (ii) as a similar proof can be done for the smallest piece. Clearly, when n = 1

FS↓1; 1

(s) = (2s− 1); s∈ (1=2; 1):

With n¿ 2, this results from the observation

FS↓1; n

(s) =P(S16 s; : : : ; Sn+16 s) = P

(U16 s; ?U 1U26 s; : : : ;

n∏m=1

?Um6 s

)

=∫ s

(1−ns)∨0P

(U26

s1 − u

; : : : ;n∏

m=2

?Um6s

1 − u

)du =

∫ s

(1−ns)+

FS↓1; n−1

(s

1 − u

)du;

where the lower integration bound is (1− ns)∨ 0 = : (1− ns)+ because the largest piece probabilitydensity with n−1 species and n intervals has support (1=n; 1). To the .rst step, we get for s∈ ( 1

3 ; 1)

FS↓1; 2

(s) = −2s log1 − s ∧ (1 − s)1 − (1 − 2s)+

+ s− s ∧ (1 − s):

Page 8: On random splitting of the interval

244 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

This distribution function and its density are continuous but the density has discontinuous derivativeat s = 1

2 .

Proposition 8. As n ↑ ∞,

S↓1; nd→ ↓1 ;

where 1=↓1 has Dickman’s distribution.

Proof. Assume S↓1; nd→ S∞, n ↑ ∞ for some non-degenerate limiting random variable S∞. Then,

from (19), there is a non-degenerate distribution function F(s) := FS∞(s) solution to the functionalequation

F(s) =∫ s

0F(

s1 − u

)du; s∈ (0; 1);

translating the distributional equality: S∞d=U1 ∨ ?U 1 · S ′∞ where S ′∞

d= S∞ is independent of U1.Writing s=(1 − u) = 1=t, this is also

F(s) = s∫ 1=s

(1−s)=sF(1=t) dt:

If ?G(x) = F(1=x) = P(1=S∞¿x), x¿ 1, we obtain

x ?G(x) =∫ x

x−1

?G(z) dz or x ?G′(x) + ?G(x − 1) = 0

showing that ?G(x) is Dickman’s function (see Holst, 2001 and the references therein). As a result,

F(s) = 1 +[1=s]∑m=1

logm sm!

; s∈ (0; 1) (20)

is the searched distribution for S∞d= ↓1 .

Concerning the minimum, we .nd

Proposition 9. As n ↑ ∞, almost surely

lim inf enS↓n+1; n = 0 and lim sup enS↓n+1; n = +∞:

Proof. After Kingman’s result stating that in a PD(1) model em↓ma:s:→ 1, m ↑ ∞, we could speculate

on the existence of a non-degenerate random variable C¿ 0 such that

enS↓n+1; nd→C¿ 0:

Let ?F(x), x¿ 0 be the complementary distribution function of C if it exists. From (19), it shouldsatisfy

?F(x) =∫ 1

0

?F(

xe(1 − u)

)du;

Page 9: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 245

meaning C d= e ?U ·C ′ where ?U is uniform, independent of C ′ d=C. After making a change of variablein the integral, it is

?F(x) =xe

∫ e=x

0

?F(

1t

)dt or x ?F ′(x) = ?F(x) − ?F(x=e):

If ?G(z)= ?F(ez), z ∈R, then the complementary distribution function ?G should solve the diQerential–diQerence equation

?G′(z) = ?G(z) − ?G(z − 1) with ?G(−∞) = 1; ?G(∞) = 0:

Clearly, under our assumptions, the only solutions to this equation are the constants ?G(z) = 0 or 1.So, the only solutions to the original functional equation under study are ?F(x) = 0, x¿ 0 (C = 0)and ?F(x) = 1, 06 x¡∞ (C = ∞).

A related problem is the following: what are the largest and smallest fragments’ distributions ifSn+1 is not taken into account? De.ning S+

1; n := S1 ∨ · · · ∨ Sn, we obtain the simpler recurrence

FS+1; 1

(s) = P(U16 s) = s; s∈ (0; 1)

and for n¿ 2

FS+1; n

(s) =P(S16 s; S26 s; : : : ; Sn6 s) = P

(U16 s; ?U 1U26 s; : : : ;

n−1∏m=1

?UmUn6 s

)

=∫ s

0P

(U26

s1 − u

; : : : ;n−1∏m=2

?UmUn6s

1 − u

)du =

∫ s

0FS+

1; n−1

(s

1 − u

)du

leading to the .rst step to

FS+1; 2

(s) =P(U16 s; ?U 1U26 s) =∫ s

0P(U26

s1 − u

)du

=− s log[1 − s ∧ (1 − s)] + s− s ∧ (1 − s):

With s∈ (0; 1), this distribution function and its density are continuous but the density has discon-tinuous derivative at s= 1

2 . Clearly, as n ↑ ∞, S+1; n

d→ ↓1 , where 1=↓1 has also Dickman’s distribution.Similarly for the minimum, de.ning S+

n;n := S1 ∧ · · · ∧ Sn, we get

?FS+1; 1

(s) = P(U1 ¿s) = 1 − s; s∈ (0; 1)

and for n¿ 2

?FS+n; n

(s) =∫ 1

s

?FS+(n−1); n−1

(s

1 − u

)du

is the probability that each of the n species occupies at least a fraction s of space (a possible survivalcondition of the colony for some suitably chosen s).

Page 10: On random splitting of the interval

246 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

2.5. The joint law of smallest and largest spacings

Similarly, the joint law of smallest and largest spacings are de.ned as follows. De.ne

Gn(s1; s2) := P(S↓n+1; n ¿ s1; S↓1; n6 s2): (21)

Then

G1(s1; s2) = s2 ∧ (1 − s1) − s1 ∨ (1 − s2); s1 ¡ 12 6 s2

and when n¿ 2

Gn(s1; s2) =∫ s2∧(1−ns1)

s1∨(1−ns2)Gn−1

(s1

1 − u;

s21 − u

)du; s1 ¡

1n + 1

6 s2:

Similarly, if now

G+n (s1; s2) := P(S+

n;n ¿ s1; S+1; n6 s2); (22)

then

G+1 (s1; s2) = (s2 − s1); s1 ¡s2:

When n¿ 2

G+n (s1; s2) =

∫ s2

s1

G+n−1

(s1

1 − u;

s21 − u

)du; s1 ¡s2:

If s1 = s, s2 = 2s, G+n (s; 2s) is the probability that each of the n species occupies at least a fraction

s and at most a fraction 2s of space so that gaps Sm − s, m= 1; : : : ; n, are not larger than s for suchoccupation con.gurations.

3. Size-biased sampling

Recall Sn := (S1; : : : ; Sn+1) with∑n+1

m=1 Sm = 1 and Snd= Sn+1. Let V be a uniformly distributed

random throw on [0; 1] and L1; n := L1; n(V ) the length of the interval of the random partitionSn containing sample V . The distribution of L1; n is characterized by the size-biased conditionalprobability

P(L1; n = Sm |Sn) = Sm;

where larger intervals are favored.Next, the distribution of the size-biased permutation of the fragments’ lengths is considered: indeed,

to avoid revisiting many times the .rstly encountered species, we must remove it from the populationonce it has .rst been met in the sampling process which requires an estimation of its length fornormalization purposes. Upon iterating this process, we are led to the size-biased permutation of thespecies weights as the sequence of proportions of species in their order of appearance in a processof random sampling from the population.

Page 11: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 247

3.1. The length of the Drst size-biased pick

From this size-biased picking construction, it follows that for all non-negative measurable functionf on [0; 1],

E[f(L1; n)=L1; n] =E[E[f(L1; n)=L1; n |Sn]]

=E

[n+1∑m=1

f(Sm)=SmP(L1; n = Sm |Sn)

]=

n+1∑m=1

E[f(Sm)]: (23)

Taking in particular f(x) = xI(x¿ s) in (23), we get

?FL1; n(s) := P(L1; n ¿ s) =n+1∑m=1

E[SmI(Sm¿s)]

which is

?FL1; n(s) =n+1∑m=1

∫ 1

stfSm(t) dt =

n−1∑m=0

∫ 1

s

t(−log t)m

m!dt +

∫ 1

s

t(−log t)n−1

(n− 1)!dt (24)

from (12), recalling fSn+1 = fSn .Taking f(x) = xq+1 in (23), the moment function of L1; n reads (q¿− 2)

E[Lq1; n] =n+1∑m=1

E[Sq+1m ] =

n∑m=1

(2 + q)−m + (2 + q)−n

=1 + q(2 + q)−n

1 + q

recalling that E[Sqm] = (1 + q)−m is the moment function of Sm, m= 1; : : : ; n, and that Snd= Sn+1. We

note that E[L1; n] =E[〈Sn〉1] = 12(1 + 3−n)¿E(S1) = 1

2 and so one expects that L1; n is stochasticallylarger than S1 (which itself is stochastically the largest from Sn). Indeed, we obtain:

Proposition 10.

L1; n

d¡ S1: (25)

Proof. DiQerentiating twice (24) with respect to s, we get

?F ′′L1; n

(s) = − 2(n− 1)!

(−log s)n−1

[1 +

n− 12 log s

]:

So ?FL1; n(s) has a single inTection point at s = e−(n−1)=2 (L1; n is unimodal). As ?F ′L1; n

(0) = 0 and?F ′L1; n

(1) = −1, we have ?FL1; n(s)¿ ?FS1(s) = 1 − s, for all s∈ [0; 1].

Finally, we note that E[Lq1; n] → 1=(1 + q) = E[Sq1 ] : as n ↑ ∞, L1; nd→ 1, uniform.

Page 12: On random splitting of the interval

248 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

3.2. Size-biased permutation of the fragments

Consider the random partition Sn. Let L1; n be the .rst size-biased pick (SBP) just discussed forthe .rst randomly chosen fragment M1 := M , so with L1; n := SM1 . A standard problem is to iteratethe size-biased picking procedure, by avoiding the fragments already encountered: by doing so, asize-biased permutation of the fragments is obtained. In the .rst step of this size-biased pickingprocedure,

Sn := S(0)n → (L1; n; S1; : : : ; SM1−1; SM1+1; : : : ; Sn+1)

which may be written as Sn → (L1; n; (1 − L1; n)S(1)n−1), with

S(1)n−1 := (S(1)

1 ; : : : ; S(1)M1−1; S

(1)M1+1; : : : ; S

(1)n+1)

a new random partition of the unit interval into n random fragments. Pick next at random an intervalin S(1)

n−1 and call L2; n its length, and iterate until all fragments have been exhausted.With L1; n := L1; n, the length of the second SBP by avoiding the .rst reads

L2; n = (1 −L1; n)L2; n;

where L1; n and L2; n are not independent. Iterating, the .nal SBP vector is Ln := (L1; n; : : : ; Ln+1; n).If (L1; n; : : : ;Ln;n) is the random pick vector at each step, then

Lm;n =m−1∏k=1

?Lk;nLm;n; m = 1; : : : ; n; (26)

Ln+1; n = 1 −n∑

m=1

Lm;n =n∏

m=1

?Lk;n

is the stick-breaking scheme construction of the size-biased sampling vector (where we have put?Lk;n := 1−Lk;n). From this well-known construction, we can compute in theory the joint distribution

of Ln. Let (m1; : : : ; mn; mn+1) be any permutation of {1; : : : ; n+1}. Looking at the fragments numbersand lengths in a full SBP procedure, we have indeed

P(M1 =m1; : : : ; Mn+1 = mn+1 |Sn)

=P(L1; n = Sm1 ; : : : ; Ln+1; n = Smn+1 |Sn) =n∏

k=1

Smk

1 −∑kl=1 Sml

Smn+1 : (27)

Consider now the joint moment function of the random size-biased pick vector Ln := (L1; n; : : : ; Ln+1; n).From (27), averaging over Sn and summing over all permutations, we have

E

[n+1∏m=1

Lqmm;n

]=

∑m1 �=···�=mn+1

E

{n∏

k=1

Sqk+1mk

1 −∑kl=1 Sml

Sqn+1+1mn+1

}: (28)

Although computable in theory, some combinatorics is involved in the evaluation of the right-hand-sideintegral.

Page 13: On random splitting of the interval

J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250 249

Recalling from (26) that (with ?qm := qm+1 + · · · + qn+1)

E

[n+1∏m=1

Lqmm;n

]= E

[n+1∏m=1

m−1∏k=1

?Lqmk;nL

qmm;n

]= E

[n+1∏m=1

Lqmm

?L ?qmm

];

(28) also gives the joint distribution of (Lm, m = 1; : : : ; n + 1).Next, note from (27) that for p∈{1; : : : ; n + 1}, for any sequence (m1; : : : ; mp+1) of all distinct

integers each in {1; : : : ; n+ 1}, with M1; : : : ; Mp the numbers of the .rst p distinct fragments whichhave been visited

P(Mp+1 =mp+1 |Sn;M1 = m1; : : : ; Mp = mp) =Smp+1

1 −∑pl=1 Sml

=P(Lp+1 = Smp+1 |Sn;M1 = m1; : : : ; Mp = mp):

As a result,

E[Lqp+1; n |M1 = m1; : : : ; Mp = mp] = E[E(Lqp+1; n |Sn;M1 = m1; : : : ; Mp = mp)]

=E

1

1 −∑pl=1 Sml

n+1∑mp+1=1

�=(m1 �=···�=mp)

Sq+1mp+1

is the simpler recursive moment function of Lp+1; n given M1 = m1; : : : ; Mp = mp.

Acknowledgements

T. Huillet thanks the Centro de Modelamiento MatemLatico CMM for its warm hospitality on avisiting occasion. J. Barrera thanks support from the Millenium Nucleus in Information and Random-ness, Programa CientLU.co Milenio P01-005, Ecos/Conicyt C99E09 (French–Chilean Cooperation),Proyecto MECESUP UCH0009 and CONICYT Ph.D fellowship.

References

Donnelly, P., 1986. Partition structures, PVolya urns, the Ewens sampling formula and the age of alleles. Theoret. Popul.Biol. 30, 271–288.

Feller, W., 1971. An Introduction to Probability Theory and its Applications, Vol. 2. Wiley, New York.Holst, L., 2001. The Poisson–Dirichlet distribution and its relatives revisited. Preprint of the Royal Institute of Technology,

Stockholm, Sweden.Johnson, N.L., Kotz, S., 1995. Use of moments in studies of limit distributions arising from iterated random subdivisions

of an interval. Statist. Probab. Lett. 24 (2), 111–119.Kingman, J.F.C., 1993. Poisson processes. Oxford Studies in Probability. Clarendon Press, Oxford.Patil, G.P., Taillie, C., 1977. Diversity as a concept and its implications for random environments. Bull. Internat. Statist.

Inst. 4, 497–515.Pitman, J., 1996. Random discrete distributions invariant under size-biased permutation. Adv. Appl. Probab. 28, 525–539.Pitman, J., 2002. Poisson–Dirichlet and GEM invariant distributions for split-and-merge transformation of an interval

partition. Combin. Probab. Comput. 11 (5), 501–514.

Page 14: On random splitting of the interval

250 J. Barrera, T. Huillet / Statistics & Probability Letters 66 (2004) 237–250

Pitman, J., Yor, M., 1997. The two parameter Poisson–Dirichlet distribution derived from a stable subordinator. Ann.Probab. 25, 855–900.

Schultz, D.M., Crouse, C.F., 1973. Random splittings: a model for a mass-size distribution. South African Statist.J. 7, 143–152.

Sibuya, M., Yamato, H., 1995. Ordered and unordered random partitions of an integer and the GEM distribution. Statist.Probab. Lett. 25 (2), 177–183.

TavarLe, S., Ewens, W.J., 1997. Multivariate Ewens distribution. In: Johnson, N.L., Kotz, S., Balakrishnan, N. (Eds.),Discrete Multivariate Distributions, Wiley, New York, pp. 232–246 (Chapter 41).

Yamato, H., Sibuya, M., Nomachi, T., 2001. Ordered sample from two-parameter GEM distribution. Statist. Probab. Lett.55 (1), 19–27.