Data mining and WAT files: format, tools and use cases Access Working Group, 30th April 2015 Using WAT at the BnF to map the First World War The WAT format: general presentation and possible developments Tools for creating WAT files: different tools and how to use them Use cases at Internet Archive and the British Library How IIPC can work on data mining Using WAT at the BnF to map the First World War: context Research project: “The future of digitised heritage online: the example of the Great War” Labex “Les passés dans le présent” (Investissements d’avenir) Examine the circulation of images online Use of data mining approaches on the BnF web archives Study of a discussion forum on WWI and its place in the map of sites related to the war Sites selected for the BnF crawl on the centenary of WWI Two parallel approaches Cartography and analysis Télécom ParisTech: research engineer (Zeynep Pehlivan) based at the BnF, working with social science researchers (Valérie Beaudouin) Development of tools (cartography, text mining…) Creation of metadata and corpus extraction Creation of WAT files at the BnF Development of tools to extract corpora (WARC files) Aim to create a service for other research projects Questions… What metadata format do we use? Do we need to adapt the WAT format? What tools do we use? How do we include them in our workflow? How do we present metadata to researchers? What other information is needed to make sense of the datasets? What tools and technical environment are needed to analyse the datasets? How do we make a generic solution that can be used for other projects? Answers…(?) Use WAT format: some minor changes Use of IIPC Web Archive Commons tool: again some minor changes Information from selection tool (BCWeb), NetarchiveSuite, crawl logs… Questions on how to filter data and define nodes Experiments with different tools: Python, MongoDB, d3.js, Gephi First results but still a work in progress …still need to define technical, legal and organisational solution to create a service for researchers Framework WAT Crawl logs Python MongoDB BCWeb Request via user interface Python MongoDB Visualisation D3.js GEXF Gephi / Tulip / Manylines etc. Next steps Include creation of WAT files in BnF workflows Extraction tools to allow researchers to define a corpus WATs can be used to identify links and embeds Final decisions on tools and technologies to be used Creation of a service for researchers What is WAT? WAT means Web Archive Transformation Metadata format for structuring metadata generated by web crawls Developed by the Internet Archive First presented at 2011 IIPC GA in The Hague Purpose: (W)ARCs files are heavy let’s extract everything usefull once WAT could be shareable feedback wanted WATs are WARCs (W)ARC files store the raw crawl data WAT files store metadata related to the data stored in WARC files WAT files are structured according to the WARC format and consist of WARC records WARC records contain metadata elements which are stored in JSON Specification: https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+M etadata+File+Specification BnF-13730-28-20150416093352-00001ciblee_2015_gulliver112.bnf.fr.warc.gz BnF-13730-28-20150416093352-00001ciblee_2015_gulliver112.bnf.fr.warc.wat.gz Metadata is organized into Container and Envelope blocks. All the metadata fields are optional. Tools for creating WAT files IIPC webarchive-commons library https://github.com/iipc/webarchive-commons ResourceExtractor tool java -classpath $CLASSPATH org.archive.extract.ResourceExtractor -wat input.warc.gz output.warc.wat.gz Also works to create CDXs Minor evolutions to ensure consistency with ISO standard WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself (#42) WAT extractor: WARC-Date in all records should be the WAT record generation date (#43) WAT extractor: missing WARC format version (#45) WAT extractor: envelope structure does not conform to the WAT specification (#44) WAT extractor: adding information in WAT's warcinfo (#47) WAT use cases Internet Archive (Vinay Goel) WAT for quality control and collection analysis British Library (Andrew Jackson) WATs as open access datasets Discussion Who uses WAT files? Who is planning to use WAT files? As downloaded datasets (e.g. Archive-It ARS)? Creating own WATs from (W)ARCs? Any evolution needed in format or tools? Need for documentation? What role should the IIPC have? Helping with standardization? Maintaining and providing documentation on WAT and associated tools? Use cases? How are WATs presented to researchers? How can data mining questions be taken forward within IIPC and the AWG?
© Copyright 2024