4
Universit´ e Paris-Dauphine Ann´ ee 2012-2013 epartement de Math´ ematique Examen NOISE, sujet A Pr´ eliminaires Cet examen est ` a r´ ealiser sur ordinateur en utilisant le langage R et ` a rendre simultan´ ement sur papier pour les r´ eponses d´ etaill´ ees et sur fichier informatique Examen pour les fonctions R utilis´ ees. Les fichiers informa- tiques seront ` a sauvegarder suivant la proc´ edure ci-dessous et seront pris en compte pour la note finale. Toute duplication de fichiers R fera l’objet d’une poursuite disciplinaire. L’absence de document enregistr´ e donnera lieu ` a une note nulle sans possibilit´ e de contestation. 1. Enregistrez r´ eguli` erement vos fichiers sur l’ordinateur, sans utiliser d’accents ni d’espace, ni de caract` eres sp´ eciaux. 2. Si vous utilisez Rkward, vous devez enregistrer ` a l’aide du bouton “Save script” (ou “Save script as”) et non “Save”. 3. V´ erifiez que vos fichiers ont bien ´ et´ e enregistr´ es en les rouvrant avant de vous d´ econnecter. N’h´ esitez pas ` a rouvrir votre fichier ` a l’aide d’un autre ´ editeur de texte afin de v´ erifier qu’il contient bien tout votre code R. 4. En cas de probl` eme ou d’inqui´ etude, contacter un enseignant sans vous d´ econnecter. Il nous est sinon impossible de r´ ecup´ erer les fichiers de sauvegarde automatique. Aucun document informatique n’est autoris´ e, seuls les livres de R le sont. L’utilisation de tout service de messagerie ou de mail est interdite et, en cas d’utilisation av´ er´ ee, se verra sanctionn´ ee. Les probl` emes sont ind´ ependants, peuvent ˆ etre trait´ es dans n’importe quel ordre. R´ esoudre trois et uniquement trois exercices au choix. Exercice 1 Download the dataset LakeHuron : > data(LakeHuron) > huron = jitter(as.vector(LakeHuron)) We assume that those observations are iid realisations X n =(X 1 ,...,X n ) of a random variable X . We denote by IQ 0.5 (X n ) the inter-quartile interval of the sample X n . It is defined as IQ 0.5 (X n )= Q 0.75 (X n ) - Q 0.25 (X n ) where Q 0.75 (X n ) and Q 0.25 (X n ) are the empirical quartiles of the sample X n at levels 75% and 25%. We would like to calibrate IQ 0.5 (X n ) by a coefficient α so that it becomes an unbiased estimator of the standard deviation σ of the distribution of the X i ’s.

R exam (A) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013

Embed Size (px)

DESCRIPTION

This is one of two exams given to our students this year. They had two hours to solve three problems and had to return R codes as well as handwritten explanations.

Citation preview

Page 1: R exam (A) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013

Universite Paris-Dauphine Annee 2012-2013Departement de Mathematique

Examen NOISE, sujet A

Preliminaires

Cet examen est a realiser sur ordinateur en utilisant le langage R et arendre simultanement sur papier pour les reponses detaillees et sur fichierinformatique Examen pour les fonctions R utilisees. Les fichiers informa-tiques seront a sauvegarder suivant la procedure ci-dessous et seront prisen compte pour la note finale. Toute duplication de fichiers R fera l’objetd’une poursuite disciplinaire. L’absence de document enregistre donneralieu a une note nulle sans possibilite de contestation.

1. Enregistrez regulierement vos fichiers sur l’ordinateur, sans utiliserd’accents ni d’espace, ni de caracteres speciaux.

2. Si vous utilisez Rkward, vous devez enregistrer a l’aide du bouton“Save script” (ou “Save script as”) et non “Save”.

3. Verifiez que vos fichiers ont bien ete enregistres en les rouvrant avantde vous deconnecter. N’hesitez pas a rouvrir votre fichier a l’aide d’unautre editeur de texte afin de verifier qu’il contient bien tout votrecode R.

4. En cas de probleme ou d’inquietude, contacter un enseignant sansvous deconnecter. Il nous est sinon impossible de recuperer les fichiersde sauvegarde automatique.

Aucun document informatique n’est autorise, seuls les livres de R le sont.L’utilisation de tout service de messagerie ou de mail est interdite et, encas d’utilisation averee, se verra sanctionnee.

Les problemes sont independants, peuvent etre traites dans n’importe quelordre. Resoudre trois et uniquement trois exercices au choix.

Exercice 1Download the dataset LakeHuron :

> data(LakeHuron)

> huron = jitter(as.vector(LakeHuron))

We assume that those observations are iid realisations Xn = (X1, . . . , Xn) of a randomvariable X.

We denote by IQ0.5(Xn) the inter-quartile interval of the sample Xn. It is defined asIQ0.5(Xn) = Q0.75(Xn) − Q0.25(Xn) where Q0.75(Xn) and Q0.25(Xn) are the empiricalquartiles of the sample Xn at levels 75% and 25%. We would like to calibrate IQ0.5(Xn)by a coefficient α so that it becomes an unbiased estimator of the standard deviation σof the distribution of the Xi’s.

Page 2: R exam (A) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013

1. Write an R function iqar(x) which produces the statistic IQ0.5(Xn) associatedwith the sample x, taking special care of the case when x has 3 elements or less.Compare your output with the one of the resident R function IQR() on huron.

2. Simulate 104 replicas of a normal N (µ, σ2) sample Xn of size n = 10 and deduce aMonte Carlo evaluation of the coefficient α such that αE[IQ0.5(Xn)] = σ. (Extra-credit : Explain why the values of µ and σ can be chosen arbitrarily.)

3. Repeat the above question with 104 replicas of a normal N (µ, σ) sample Xn of sizen = 50. (Extra-credit : Do you notice enough similarity between both α’s to acceptthe hypothesis that they are equal ?)

4. Getting back to the case of question 2., when n = 10, and using the 104 reali-sations of IQ0.5(Xn) generated in question 2., deduce a 96% confidence intervalon IQ0.5(Xn)/σ. (Hint : Use the empirical cdf of the IQ0.5(Xn)’s, rather thanbootstrap.) Compare with the asymptotically normal 96% confidence interval onE[IQ0.5(Xn)]/σ. Check whether or not 1.3490 belongs to these intervals. (Extra-credit : Justify the choice α = 1/1.3490.)

5. Check whether or not huron is distributed from a normal sample (with unknownmean and variance).

6. Since huron is not necessarily a normal sample, denoting by σ the standard deviationof the distribution of the Xi’s, construct by bootstrap a 96% confidence intervalon E[IQ0.5(Xn)]/σ, where σ is estimated by the usual empirical standard deviateσ(Xn). Does it still contain 1.3490 ?

Exercice 2Consider the Rider density function

fk(x) =n!

(k!)2

(1

4− 1

π2arctan2 x

)k 1

π(1 + x2),

where n = 2k + 1 and k ≥ 1 is an integer.

1. Check by numerical integration that fk is a proper density for k = 5, 10, 20

2. Design an accept-reject algorithm function on R that produce an iid sample ofarbitrary size m for an arbitrary parameter k. Produce a graphical verification ofthe fit for m = 103 and k = 5, 10, 20.

3. We want to check from the acceptance rate of this accept-reject algorithm thatthe normalisation is correct in the above. Produce 520 realisations of an empiricalacceptance rate based on 100 proposals and deduce a 94% confidence interval onthe expectation of the acceptance rate. Check whether or not it contains the inversenormalising constant.

4. This density is actually the distribution of the median of a Cauchy sample of sizen = 2k+1. Generate a sample from the above accept-reject algorithm with m = 520and k = 10, then another sample of m = 520 medians from samples of 21 Cauchyvariates. Test whether they have the same distribution.

5. Check whether or not the p-value of the above test is distributed as a uniform U(0, 1)random variate. (Extra-credit : Establish why the distribution of the p-value shouldbe uniform.)

Page 3: R exam (A) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013

Exercice 3If U1, U2, . . . , Uk is a sample from the U(0, 1) distribution, then Mk = min(U1, . . . , Uk)follows the Beta(1, k) distribution. We wish to verify that

kMkL−−−→

k→∞Exp(1)

1. Create a function rbeta2(n, k) which simulates n realizations of the Beta(1, k)distribution, using nk realizations of the uniform distribution. (Note : if you do notmanage this question, you can use the R function rbeta(n,1,k) for the remainderof the exercise.)

2. For k = 50 and n = 1000, propose a graphical way to verify the fit of kMk to theExp(1) distribution.

3. Using ks.test() and n = 1000, check whether the exponential distribution is anacceptable fit when k = 10, k = 50, k = 200.

4. From now on, k = 200 and n = 1000. We now have a test to check the fit of a samplex to the Beta(1, k) distribution : we accept the null hypothesis that x comes fromthe Beta(1, k) distribution iff the Kolmogorov-Smirnov test accepts the hypothesisthat kx fits the Exp(1) distribution. Perform a bootstrap experiment to calculatethe probability of accepting the null hypothesis for a sample which comes from theBeta(1, k) distribution.

5. Perform another bootstrap experiment to calculate the same probability when usingdirectly the Kolmogorov-Smirnov test for fit to the Beta(1, k) distribution (whosecdf exists in R as pbeta).

Exercice 4The SkewLogistic(α) distribution defines a random variable X which takes values in Rand with cumulative distribution function

F (x) =1

(1 + e−x)α

1. Using the generic inversion method, write a function rskewlogistic(n,α) whichoutputs n realizations of the SkewLogistic(α) distribution.

2. For α = 2, give a Monte Carlo experiment to estimate V ar(X) and the median ofX. Calculate (on paper) the theoretical value of the median and compare it to yourestimate.

3. Propose a bootstrap experiment to evaluate the bias of your variance and medianestimators.

4. For α = 2, use the Kolmogorov-Smirnov test to verify that the variable

Y = log(1 + e−X)

follows an Exp(2) distribution.

Exercice 5Given the probability density

f(x|θ, δ) =C

θe−|x−δ|θ ,

Page 4: R exam (A) given in Paris-Dauphine, Licence Mido, Jan. 11, 2013

1. explain why an importance sampling technique, designed to approximate theconstant C, that is based on the Normal density cannot not work. Illustrate thislack of convergence with a numerical experiment using θ = 2 and δ = 4.

2. Propose a more suitable importance distribution.

We now focus on the integral

I =

∫Rxf(x|2, 4)dx

using samples of size n = 102.

3. Propose a Monte Carlo approximation of I. (Hint : Note that the integral over R istwice the integral over R+ when δ = 0 and connect f with a standard distributionon (δ,∞).)

4. Approximate I by importance sampling using the same distribution g as in question2.

5. Compute a confidence interval on I at level 95% for each of your method. Whichone of the two estimates does reach the lowest precision ?

6. Design a Monte Carlo experiment in order to check whether or not the asymptoticcoverage level of the CI holds. Repeat the experiment with samples of size n = 103.

Exercice 6Given the Galton density on R∗+,

f(x|µ, σ) =1

xσ√

2πexp{−(log(x)− µ)2/2σ2}

1. Determine which of the following distributions can be used in an A/R algorithmdesigned to sample from f(x|0, 1) :

g1(x) =k

λ(x

λ)k−1e−(xλ)

kg2(x) =

1

θk1

Γ(k)xk−1e−

xθ g3(x) = (1 + αx)−1/α−1

which are respectively a Weibull, a Gamma and a generalized Pareto distribution.Determine the appropriate upper bounds.

2. Using the inversion method write an algorithm that samples from the selected g.

3. Write an R function AR() that samples from f(x|0, 1). (Extra-credit : Optimize theparameters of the proposal density g.)

4. Based on a sample of size 104 from f(x|0, 1), estimate by Monte Carlo the meanand variance of h(X) = log(X) when X ∼ f and give a confidence interval at level95% for both quantities.

5. The distribution associated with f can be obtained by the transform exp{Z} whenZ ∼ N (µ, σ). Establish this result and test it, based on the sample used in question4.