Download Report

Data mining and WAT
files: format, tools and
use cases
Access Working Group, 30th April 2015
Using WAT at the BnF to map the First World War
The WAT format: general presentation and possible
developments
Tools for creating WAT files: different tools and how to use
them
Use cases at Internet Archive and the British Library
How IIPC can work on data mining
Using WAT at the BnF to map the First
World War: context
Research project: “The future of digitised heritage online: the
example of the Great War”
Labex “Les passés dans le présent” (Investissements d’avenir)
Examine the circulation of images online
Use of data mining approaches on the BnF web archives
Study of a discussion forum on WWI and its place in the map
of sites related to the war
Sites selected for the BnF crawl on the centenary of WWI
Two parallel approaches
Cartography and analysis
Télécom ParisTech: research engineer (Zeynep
Pehlivan) based at the BnF, working with social
science researchers (Valérie Beaudouin)
Development of tools (cartography, text mining…)
Creation of metadata and corpus extraction
Creation of WAT files at the BnF
Development of tools to extract corpora (WARC
files)
Aim to create a service for other research projects
Questions…
What metadata format do we use? Do we need to adapt the
WAT format?
What tools do we use? How do we include them in our
workflow?
How do we present metadata to researchers? What other
information is needed to make sense of the datasets?
What tools and technical environment are needed to analyse
the datasets?
How do we make a generic solution that can be used for
other projects?
Answers…(?)
Use WAT format: some minor changes
Use of IIPC Web Archive Commons tool: again some minor
changes
Information from selection tool (BCWeb), NetarchiveSuite,
crawl logs…
Questions on how to filter data and define nodes
Experiments with different tools: Python, MongoDB, d3.js,
Gephi
First results but still a work in progress
…still need to define technical, legal and organisational
solution to create a service for researchers
Framework
WAT
Crawl logs
Python
MongoDB
BCWeb
Request via user
interface
Python
MongoDB
Visualisation
D3.js
GEXF
Gephi / Tulip /
Manylines etc.
Next steps
Include creation of WAT files in BnF workflows
Extraction tools to allow researchers to define a corpus
WATs can be used to identify links and embeds
Final decisions on tools and technologies to be used
Creation of a service for researchers
What is WAT?
WAT means Web Archive Transformation
Metadata format for structuring metadata generated by web
crawls
Developed by the Internet Archive
First presented at 2011 IIPC GA in The Hague
Purpose:
(W)ARCs files are heavy
let’s extract everything usefull once
WAT could be shareable
feedback wanted
WATs are WARCs
(W)ARC files store the raw crawl data
WAT files store metadata related to the data stored in WARC
files
WAT files are structured according to the WARC format and
consist of WARC records
WARC records contain metadata elements which are stored
in JSON
Specification:
https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+M
etadata+File+Specification
BnF-13730-28-20150416093352-00001ciblee_2015_gulliver112.bnf.fr.warc.gz
BnF-13730-28-20150416093352-00001ciblee_2015_gulliver112.bnf.fr.warc.wat.gz
Metadata is organized
into Container and
Envelope blocks. All the
metadata fields are
optional.
Tools for creating WAT files
IIPC webarchive-commons library
https://github.com/iipc/webarchive-commons
ResourceExtractor tool
java -classpath $CLASSPATH
org.archive.extract.ResourceExtractor -wat input.warc.gz
output.warc.wat.gz
Also works to create CDXs
Minor evolutions to ensure consistency with ISO standard
WAT extractor: WARC-Filename in the WAT warcinfo record should be the
WAT filename itself (#42)
WAT extractor: WARC-Date in all records should be the WAT record
generation date (#43)
WAT extractor: missing WARC format version (#45)
WAT extractor: envelope structure does not conform to the WAT
specification (#44)
WAT extractor: adding information in WAT's warcinfo (#47)
WAT use cases
Internet Archive (Vinay Goel)
WAT for quality control and collection analysis
British Library (Andrew Jackson)
WATs as open access datasets
Discussion
Who uses WAT files? Who is planning to use WAT files?
As downloaded datasets (e.g. Archive-It ARS)? Creating own
WATs from (W)ARCs?
Any evolution needed in format or tools? Need for
documentation?
What role should the IIPC have? Helping with
standardization? Maintaining and providing documentation
on WAT and associated tools?
Use cases? How are WATs presented to researchers?
How can data mining questions be taken forward within IIPC
and the AWG?