20
L’Archivage du Web Julien Masanès Internet Memory Foundation 1 Collège de France Mars 2012

L'archivage du Web, présentation college de france

Embed Size (px)

DESCRIPTION

Présentation au séminaire de Serge Abiteboul au collège de France sur l'archivage web (mars 2012)

Citation preview

Page 1: L'archivage du Web, présentation college de france

L’Archivage du WebJulien Masanès

Internet Memory Foundation

1

Collège de FranceMars 2012

Page 2: L'archivage du Web, présentation college de france

Introduction

• Centralité du web, application de publication de l’internet

• Premier artefact culturel, source pour l’histoire et la science du future

• Ce que la problématique de sa préservation nous apprend de ce média

2

Page 3: L'archivage du Web, présentation college de france

L’objet

3

Page 4: L'archivage du Web, présentation college de france

Mesure

• infini (génération à la demande)

• cela dépend de l’outil de mesure (crawler)

4

Page 5: L'archivage du Web, présentation college de france

• 555 millions de sites web (Décembre 2011).

200 millions nouveaux sites en 2011

• 152 millions blogs (2010 BlogPulse).

• 250 millions tweets par jour sur Twitter en (Oct-2011)

• 30 milliards d’éléments de contenus (liens, notes, photos, etc.) partagés sur Facebook chaque mois (2010)

5

Mesure

Page 6: L'archivage du Web, présentation college de france

6

Mesure

http://www.worldwidewebsize.com/

Page 8: L'archivage du Web, présentation college de france

Structuré ou non ?

• HTML URLs parsé 1,486,186,868

• Domains with Triples 65,408,946

• URLs with Triples 302,809,140

• Typed Entities 1,222,563,749

• Triples 3,294,248,652

8

Web Data Commons, http://webdatacommons.org/

Page 9: L'archivage du Web, présentation college de france

Un système de publication actif

• Web Information Systems• Contrôle par le producteur• Publication continue (y compris pages anciennes

‘archivées’)• Frontières de l’objet visé sont flou (un site? )

9

Conserver implique exactement l’opposé

Page 10: L'archivage du Web, présentation college de france

Le Web comme artefact culturel

• Multimédia, convergence de tous les types de contenus numériques • Hypertexte actionnable• Edité globalement par des centaines de millions

de personnes

10

Conservation sans le filtrage traditionnel de l’édition

Page 11: L'archivage du Web, présentation college de france

Cardinalité

• Différent selon les institutions (musées, archives, bibliothèques)• Cardinalité des incunables

– 20 millions de livres– 30 000 éditions– 650

• Une cardinalité élevée donne deux avantages pour la conservation : la redondance et le temps

11

Page 12: L'archivage du Web, présentation college de france

La cardinalité ‘paradoxale’ du Web

• Un nombre virtuellement infini de copies• Mais une très forte dépendance à un serveur unique

12

Page 13: L'archivage du Web, présentation college de france

Capture et cohérence

• extension temporelle incompressible des capture• en contradiction avec la publication permanente• risque d’incohérence temporelle au sein même de

l’archive

13

Page 14: L'archivage du Web, présentation college de france

:: html

:: image, video, audio

:: dns

:: javascript, flash, css, rdf

:: pdf, zip, ps other binary data (without multimedia)

Legend:

:: coherent

:: content incoherent (text only)

:: link structure incoherent

:: content completely removed

Color :: Coherence Status Shape :: MIME Type

:: html

:: image, video, audio

:: dns

:: javascript, flash, css, rdf

:: pdf, zip, ps other binary data (without multimedia)

:: html

:: image, video, audio

:: dns

:: javascript, flash, css, rdf

:: pdf, zip, ps other binary data (without multimedia)

Legend:

:: coherent

:: content incoherent (text only)

:: link structure incoherent

:: content completely removed

:: coherent

:: content incoherent (text only)

:: link structure incoherent

:: content completely removed

Color :: Coherence Status Shape :: MIME Type

Figure 4: Coherence defect visualization of a single crawl-recrawl pair of mpi-inf.mpg.de by visone

14

Spaniol, A. Mazeika, D. Denev and G. Weikum:''Catch me if you can'': Visual Analysis of Coherence Defects in Web Archiving

Proceedings of the 9th International Web Archiving Workshop (IWAW 2009), in conjunction with the ECDL 2009

Page 15: L'archivage du Web, présentation college de france

L’archive

15

Page 16: L'archivage du Web, présentation college de france

Une mémoire de la toile

• Echantillonnage automatique raisonné et documenté• Saisie d’un état • Construction de séries temporelles pertinentes• Inclusion dans l’internet

16

Page 17: L'archivage du Web, présentation college de france

Une infrastructure pour la science

• rôle dans la construction du savoir• quel sera l’équivalent des bibliothèque et des

archives pour le web ?• CERN de la Web Science• Inclusion dans l’internet

17

Internet Archive: http://archive.org/Internet Memory : http://internetmemory.orgIIPC : http://netpreserve.org/Bibliothèque Nationale de France : http://www.bnf.fr

Page 18: L'archivage du Web, présentation college de france

M. Toyoda et M. Kitsuregawa, A system for visualizing and analyzing the evolution of the web with a time series of graphs, Salzburg, Austria: ACM Press New York, NY, USA, 2005.  

Figure 5: Evolution of search engines for mobile phone internet services

is positioned almost at the same place over time. When c2

becomes greater than 1, the strictness of synchronization isweakened. This parameter can be also modified by the user.

Synchronizing the cluster viewTo keep track of clusters in the cluster view, we first deter-mine main lines of clusters (i.e. sequences of the correspond-ing clusters), and synchronize their positions in each mainline. Then, clusters not in main lines are arranged accord-ing to their merging and splitting behavior. For example,two clusters are attracted when they are merged at the nexttime.

For each cluster Ckt , we define the corresponding cluster

Ckt+1 at time t+ 1 as the cluster that shares the most URLs

with Ckt . If there were multiple clusters that share the same

number of URLs, we select a community that has the largestnumber of URLs. We can reversely identify the cluster attime t corresponding to Ck

t+1. When this corresponding clus-ter is just Ck

t , we call the sequence (Ckt , Ck

t+1) as the mainline of Ck

t . The main line is recursively extended over time.On a cluster Ck

t in a main line, Ft!1 and Ft+1 are exertedas same as in the di!erence view.

There are many clusters that are not in main lines, andare merged into or split from main lines. Such clusters areattracted to related main lines whether they are connectedor not. In this way, we can show that these clusters willbe merged at the next time, or have split from the previoustime. For example, in Figure 3, we can see that P2P systems,such as Napster and Gnutella, are merged into a cluster at2001, and they are located near to the cluster at 2000.

Cit is merged into a main line (Ck

t , Ckt+1), when Ci

t != Ckt

and Cit "Ck

t+1 != #. In this case, Cit is attracted to the main

line. That is, the attractive force Fa (the same force onconnected nodes) is exerted on Ci

t and Ckt . When there are

multiple main lines in which Cit is involved. Ci

t is attractedto each main lines. Similarly, Ci

t+1 is split from a main line(Ck

t , Ckt+1), when Ci

t+1 != Ckt+1 and Ci

t+1 " Ckt != #. In this

case, Fa is exerted on Cit+1 and Ck

t+1.Initially, nodes are randomly located in each panel, then

each nodes are iteratively moved by those forces. The layoutis fixed, when the total movement of nodes become less thana threshold. This iterative layout is shown by animation inWebRelievo.

The user can scroll and zoom into graphs. These kinds ofchanges in a graph are immediately propagated to all graphsfor keeping layouts synchronized. Nodes in all graphs canbe moved by dragging. When the user drags a node in agraph, the same node is moved in each graph. Layouts of allgraphs are re-calculated and animated by the force-directedmodel simultaneously, so that the user can keep track of theevolution after those operations.

In our current implementation, WebRelievo can handleseveral hundred nodes in each graphs on a PC with a 2.8GHz Pentium 4. In the cluster view, it means that we can seerelationships between a few thousands of URLs. To displaysuch large graphs, it requires a high resolution screen. Wecurrently run our system on a high resolution wall displaywith 5120x2304 pixels. Note that screen snapshots in thispaper are taken on a screen with 1900x1200 pixels, and sizesof fonts are bigger than usual for readability in the paper.

157

Page 19: L'archivage du Web, présentation college de france

Quel régime d’archive ?

• ce que l’on garde ce que l’on ne garde pas (valeur) ?• droit à l’oubli ?• vie privée • accès (humain/machines)• ...

19

Page 20: L'archivage du Web, présentation college de france

Julien MasanèsInternet Memory Foundation

internetmemory.org

20

Aux archivistes du Web