Conclusions Surprisingly, the general attitude of librarians on preservation issues has... cantly in the last 15 years, despite the great change...

Cultural heritage completo LTC
7-02-2008
12:31
Pagina 250
TOMMASO GIORDANO
EUROPEAN UNIVERSITY INSTITUTE, FIESOLE
Conclusions
Surprisingly, the general attitude of librarians on preservation issues has not changed significantly in the last 15 years, despite the great change that has occurred in knowledge management and cultural communication systems. On the other hand, the increased awareness of the
problem has not reduced the gap between perception and practice. On the organizational level,
however, the difference between the traditional approach to collection development and the
emerging model is quite radical, as is the professional culture that they respectively imply.
However, we are not dealing solely with a “cultural” issue; it is also a structural matter of vast
dimensions that undermines the business model, which has until now supported libraries.
There is in progress a radical shift from an economic model based on the accumulation (and
'capitalization') of the resources acquired to a model based on renting resources for temporary use with no heritage and no guarantees for the future. It is not a change - it is a genetic
mutation in libraries, which is challenging the foundations of modern librarianship. “The
library is a growing organism,” Ranganathan's 5th law declares; the sustainability of this principle is now an open question.
250
CATHERINE LUPOVICI
WEB ARCHIVING:
WHAT SHALL WE PRESERVE AND
HOW TO MAKE IT USABLE?
Cultural heritage completo LTC
7-02-2008
12:31
Pagina 252
CATHERINE LUPOVICI
BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS
WEB ARCHIVING:
WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE?
CATHERINE LUPOVICI
value to the pages. So the Web can only be archived on a periodical basis and the archive
She is Head of the Digital Library Department, Direction des Services et des Réseaux,
must include the links as part of the content itself. In addition the mass can only be handled
Bibliothèque Nationale de France, this department is in charge of the coordination of pilot
by automatic processes.
projects building on digital library services, technologies and standards. Prior to joining BnF
› The Web is divided into two parts: the surface web or visible web which is accessible to
she was for ten years Libraries Activity Manager within Jouve SA, a French electronic publi-
robots for harvesting and indexing and the deep web with restricted access to robots
shing company offering data conversion and document scanning services. Her previous
because of passwords or technical limitations.
experiences include national responsibilities for university libraries co-operation networking
The Web is exponentially growing and its real size is difficult to know precisely. If we refer to
and automation in France within the Ministry of Education, DBMIST (Direction des
the OCLC web characterization study1, the number of unique web sites grew from 2.636.000
Bibliothèques, des Musées et de l'Information Scientifique et Technique) and heading the
sites in 1998 to 7.128.000 sites in 2000 and 8.712.000 sites in 2002. Out of this numbers, the pub-
National Academy of Medicine Library, Paris.
lic sites (sites or significant portion of sites accessible free of charge and without any restriction) grew from 1.457.000 in 1998 to 2.942.000 sites in 2000 and 3.080.000 sites in 2002.
Other studies focused on the indexable pages discovered by the search engines. A first study
ABSTRACT
estimated in 19972 the number of pages at 200 million and a similar study conducted in January
Web contents are more and more complementing the classical resources and cultural and
scientific institutions are looking at integrating Web material into their collection development policies. The shift in scale of the number of items that the Web represents cannot be
processed in a traditional way in term of acquisition, description and access. The
International Internet Preservation Consortium (IIPC) was launched in 2003 between institutions already involved in web archiving. The objectives are to share the understanding of the
specific requirements and to develop appropriate methods, tools and standards that will
20053 estimated the number at 11, 5 billion pages. If we consider the national domain of a
European country like France, a snapshot of the .fr domain resulting of a one month broad crawl
processed fall 2004 was 121 million files pertaining to 500 000 different hosts. The size of the snapshot is 3 TB. By comparison the BnF receives about 60 000 monographs by Legal deposit per year.
We can see that the number of items to be considered by a National Library having the memory mission to preserve and provide access to what is made available publicly in the country
represents a tremendous change of scale when going to the Web.
allow the future interoperability of the repositories in order to facilitate cross access to largescale collections and usage in the future through smart analysis tools that will be developed.
HOW NATIONAL LIBRARIES
ARE ARCHIVING THE
WEB
Collection policy
WHY
ARCHIVING THE
WEB
Web contents are more and more complementing the classical resources that are traditionally acquired by cultural and scientific institutions and it is obvious today that the Web is the
place where classical documents are becoming digital.
Web contents typology is extending from the classical publications with more and more self
publications as well as grey literature to emerging types of online contents which are contin-
National Libraries started web archiving in the mid 90s with different collection policy
approaches.
The National Library of Canada in 1994 then the National Library of Australia in 1996 started
with the deposit of only digital resources (e.g. e-journals). They processed them in the same
way as classical resources for the deposit workflow. The corresponding collections were catalogued item per item.
uously appearing like digital arts, e-learning, e-business but also blogs and new public spaces
dedicated to discussion and chat.
The Web has also specific technical characteristics creating new challenges for acquisition,
preservation and communication. The most significant are:
› The Web represents massive dynamic contents with a lot of interlinking adding semantic
252
1
O'Neill, Edward T. ; Lavoie, Brian F. ; Bennett, Rick. - Trends in the evolution of the public web 1998-2002. In D-Lib Magazine,
April 2003, vol. 9, n° 4, http://www.dlib.org/dlib/april03/lavoie/04lavoie.html
2
K.Bharat and A.Broder. - A technique for measuring the relative size and overlap of public search engines. WWW conference 1998
3
A.Gulli and A.Signorini. - The indexable web is more than 11.5 billion pages. WWW conference 2005.
http://www.cs.uiowa.edu/~asignori/web-size/
253
Cultural heritage completo LTC
7-02-2008
12:31
Pagina 254
CATHERINE LUPOVICI
BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS
WEB ARCHIVING:
WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE?
The Royal Library of Sweden started in 1997, just like Internet Archive, with periodical auto-
Navigation through the archive by URLs and over time
matic harvesting of the national web domain building on the model of a domain centric policy.
Those minimum requirements are already offered by IIPC members and tools in the public
Of course the collections were not catalogued but only indexed by the URLs of the files.
domain are available for download on the consortium web site7.
The Library of Congress started thematic event based and worldwide coverage harvesting for
The following example shows the interface developed by the Nordic Web Archives project.
presidential elections in 2000 and 11 September 2001 building on a topic centric policy. The
The demo available on line8 is build with the IIPC tools applied to several snapshots of the insti-
collected sites were not catalogued in a classical way but some descriptive metadata were
tutional sites of IIPC partners. It applies the principles defined by the consortium members for
provided at the collection level.
the minimum requirements.
The National libraries already started in web archiving created in partnership with Internet
› Search by URI
Archive the International Internet Preservation Consortium4 in July 2003 with the objectives to
› Full text search
share their expertise and to develop appropriate methodologies and tools for the whole pro-
› Time line with the harvest times available for the selected URL. The resolution of the time
cessing chain considering web archive collection at least at the domain scale.
line can be set up on the following values: minutes, hours, days, months, years, which cor-
The work done in the consortium demonstrated that complementary approaches have to be
responds to the possible frequency parameter used for harvesting.
implemented to collect public web sites as well as the deep web.
› Harvesting is preserving the web inter-linking and navigation feature and the context has to
be recorded at harvest time.
› Deposit by the producer allows getting the deep web. The relationships with the producers
facilitate the negotiation for inclusion of some preservation metadata in the deposited files.
The consortium elaborated a format allowing to handle a large number of harvested files as
well as deposited files and to record preservation metadata compliant with the OAIS5 (Open
Archival Information System) information model. The WARC6 format has been accepted as an
ISO TC46 work item and is intended to become an ISO standard.
User access
Beyond the copyright problem that leads to restrict the public access to the result of web harvesting in some countries where the Legal Deposit legislation has already been extended to
the Web, search and navigation across huge collections of web archives is a new challenge.
The minimum requirements identified by the IIPC members for access tools to collections at
the domain level are:
In addition to those simple features already available whatever the size of the collections, the
Indexing and search by URI and by date of harvest.
IIPC members recognized the need for smarter tools enabling automatic classification and
Full text search engines like functionality
semantic organization of the collections.
The next programmes of the IIPC will concentrate on such tools.
4
IIPC web site. http://netpreserve.org
Model for an Open Archival Information System. Consultative Committee for Space Data Systems. Blue book, January 2002
http://public.ccsds.org/publications/archive/650x0b1.pdf
6
WARC, Web ARChive file format. http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
5
254
7
8
http://netpreserve.org/software/downloads.php
WERA (Web aRchive Access) http://nwa.nb.no/wera/index.php
255
Cultural heritage completo LTC
7-02-2008
12:31
Pagina 256
CATHERINE LUPOVICI
BIBLIOTHÈQUE NATIONALE DE FRANCE, PARIS
WEB ARCHIVING
AT
BNF
WEB ARCHIVING:
WHAT SHALL WE PRESERVE AND HOW TO MAKE IT USABLE?
Elections have been archived (23 million files). The latest (at the end of 2005) run on 4 500 sites
BnF started its web archiving experiments as pilots for the extension of the legal deposit legis-
(40 million files, a third of which were blogs). The current one is about the presidential and
lation to online publications. Legal deposit is the legal obligation for every publisher, printer,
general elections and runs from October 2006 to June 2007. The election collections are
producer, distributor, importer of documents to deposit copies of all published materials in the
indexed by URL, date of harvest and subject metadata provided by curators at the selection
mandated institutions. Originally promulgated for printed books in 1537, legal deposit has been
time. BnF is also carrying out a user study of students and scholars of the Institut d'études
progressively extended to all types of materials of expression and creation, including new tech-
politiques de Paris on the use of election collections of 2002 and 2004.
nologies as they appeared in France. After books, engravings, music scores, photographs,
Continuous crawl: the online edition of the Journal official de la République française (the
posters, audiovisual and multimedia documents, time has come to archive Web sites as well.
Government's main publication) is harvested on a daily basis. The full collection of the title has
9
been harvested since the first online issue in June 2004.
In 2006 the legislation has been extended in two directions.
› The DADVSI Law (DADVSI stands for Droit d'auteur et droits voisins dans la société de l'in-
IMPACT OF LARGE SCALE DIGITAL COLLECTIONS ON PRESERVATION AND ACCESS POLICIES
formation- loi 2006-961) was officially published on August 3rd, 2006. The BnF has waited a
The size of digital collections is currently changing drastically in our institutions. The key
long time for this law, which Title IV (Clauses 39 to 47) officially establishes the Web legal
changes are produced on one side by the mass digitisation programmes resulting of the impul-
deposit. All the collections created during the pilot phase since 2001 will be made accessi-
sion given by initiatives like Google Library or Open Content Alliance and aiming at bringing
ble to authorized visitors of BnF in the reading rooms of the Research Library only. This
more classical contents available on the web and on the other side by the introduction of web
restriction applies to all legal deposit collections. Access will be authorized only after pub-
contents into our cultural heritage through more and more web archiving initiatives.
lication of a specific decree, possibly in 2007.
The mass digitisation of classical analogue collections will allow using the digital surrogates
› In June 2006 a modification of the previous legal deposit decree allowed BnF to negotiate
in place of the original not only for remote access but also locally in the library. It is reducing
with the producers the deposit of electronic files in place of the previous classical medium
the pressure of classical preservation actions requested for the more on demand analogue
(for instance paper).
material and allows to transfer the corresponding resources from preservation to digitisation.
For instance in BnF the Digitisation service was attached to the Preservation department in
The current archiving process is organised in three complementary collecting methods:
2004 in order to facilitate the evolution.
Bulk automatic harvesting of French national domain websites. BnF signed a 3-year agree-
The possibility offered to BnF by the law to choice between the electronic format and the
ment with Internet Archive (IA) in 2004 whereby both partners agreed to embark on a research
printed output urges to manage digital preservation of large scale digital collections. The top
project on the French national domain. Through this partnership, BnF has captured two snap-
priorities for cultural heritage institutions is moving towards setting up trusted repositories
shots of French domain sites, at the end of 2004 and 2005. Each snapshot contains 118 to 140
providing risks management as well as long term preservation and access. In addition the
million files equal to a volume of 7 Terabytes. The third snapshot is under process. The target
access applications have to improve and to offer good search and navigation facilities not
is two snapshots per year. The whole collections are indexed by URL, date of harvest and the
only through the mediation of descriptive metadata but also using smart tools for content
current one will be in addition full text searchable.
analysis and facets navigation.
Thematic focused harvesting on a selection of sites by subject or reference librarians. Focus
crawls can be thematic or event-based. About 3 500 websites from the 2002 and 2004 French
9
Legal deposit : five questions about Web Archiving at BnF.
http://www.bnf.fr/pages/version_anglaise/depotleg/dl-internet_quest_eng.htm
256
257