If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

Similar documents
University of Massachusetts Amherst Libraries. Digital Preservation Policy, Version 1.3

Strategy for a Digital Preservation Program. Library and Archives Canada

Interoperable systems that are trusted and secure

CO-ORDINATION MECHANISMS FOR DIGITISATION POLICIES AND PROGRAMMES:

Digital Preservation Program: Organizational Policy Framework (06/07/2010)

Digitisation Plan

STRATEGIC FRAMEWORK Updated August 2017

From Observational Data to Information IG (OD2I IG) The OD2I Team

FACULTY OF ENGINEERING & INFORMATION TECHNOLOGIES RESEARCH DATA MANAGEMENT PROVISIONS 2015

EPOS research infrastructure (European Plate Observing System)

DISPOSITION POLICY. This Policy was approved by the Board of Trustees on March 14, 2017.

Guidelines for the Professional Evaluation of Digital Scholarship by Historians

HUMANITIES, ARTS & CULTURE DATA SUMMIT. Rachel Fensham Digital Studio, University of Melbourne

. Faye Goldman. July Contents

COMMISSION RECOMMENDATION. of on access to and preservation of scientific information. {SWD(2012) 221 final} {SWD(2012) 222 final}

The Library's approach to selection for digitisation

Public Art Network Best Practice Goals and Guidelines

Digital Preservation Policy

At its meeting on 18 May 2016, the Permanent Representatives Committee noted the unanimous agreement on the above conclusions.

University of Kansas. The University of Kansas Libraries

Open Science for the 21 st century. A declaration of ALL European Academies

Stirring The Cauldron: Redefining Computational Archival Science (CAS) For The Big Data Domain

Mining Innovation: The Importance of Science Entreprise

PREFACE. Introduction

Creating a New Kind of Knowledge Institution. Directions for JUNE 2004

Department of Arts and Culture NATIONAL POLICY ON THE DIGITISATION OF HERITAGE RESOURCES

International Symposium on Knowledge Communities 2012

G9 - Engineering Council AHEP Competencies for IEng and CEng

Pan-Canadian Trust Framework Overview

Strategic Plan for CREE Oslo Centre for Research on Environmentally friendly Energy

RESEARCH DATA MANAGEMENT PROCEDURES 2015

Office of Science and Technology Policy th Street Washington, DC 20502

Over the 10-year span of this strategy, priorities will be identified under each area of focus through successive annual planning cycles.

What is a collection in digital libraries?

Open Science. challenge and chance for medical librarians in Europe.

Committee on Development and Intellectual Property (CDIP)

THE UNIVERSITY OF MANCHESTER PARTICULARS OF APPOINTMENT FACULTY OF HUMANITIES SCHOOL OF SOCIAL SCIENCES SOCIAL ANTHROPOLOGY DALTON RESEARCH ASSOCIATE

EMERGING WRITERS FELLOWSHIP FAQ

Faculty of Humanities and Social Sciences

Development in Social Science Research Infrastructures

Canadian Technology Accreditation Criteria (CTAC) CIVIL ENGINEERING TECHNOLOGY - TECHNICIAN Technology Accreditation Canada (TAC)

ADVANCING KNOWLEDGE. FOR CANADA S FUTURE Enabling excellence, building partnerships, connecting research to canadians SSHRC S STRATEGIC PLAN TO 2020

NCRIS Capability 5.7: Population Health and Clinical Data Linkage

Documentary Heritage Development Framework. Mark Levene Library and Archives Canada

Brief to the. Senate Standing Committee on Social Affairs, Science and Technology. Dr. Eliot A. Phillipson President and CEO

EMERGING WRITERS FELLOWSHIP FAQ

"Working Groups for Harmonisation and Alignment in Brain Imaging Methods for Neurodegeneration" Final version

A STUDY ON THE DOCUMENT INFORMATION SERVICE OF THE NATIONAL AGRICULTURAL LIBRARY FOR AGRICULTURAL SCI-TECH INNOVATION IN CHINA

Web 2.0 in social science research

BA (Hons) Photography Length of Course

Transportation Education in the New Millennium

EBLIDA submission to the European Commission Consultation: Europeana: next steps

RecordDNA DEVELOPING AN R&D AGENDA TO SUSTAIN THE DIGITAL EVIDENCE BASE THROUGH TIME

VISUAL ARTS COLLECTION COORDINATOR

Media and Communication (MMC)

INSPIRE FOSTERING INNOVATION? SUGGESTION ON A LIVING LAB APPROACH

Report on the Results of. Questionnaire 1

TERMS OF REFERENCE FOR CONSULTANTS

Embedding Digital Preservation across the Organisation: A Case Study of Internal Collaboration in the National Library of New Zealand

Mapping the Design Criterion Framework for Museum Exhibition Design Project

Science of Science & Innovation Policy (SciSIP) Julia Lane

DARPA-BAA Next Generation Social Science (NGS2) Frequently Asked Questions (FAQs) as of 3/25/16

InterPARES Project. The Future of Our Digital Memory. The Contribution of the InterPARES Project to the Preservation of the Memory of the World

Revisiting the USPTO Concordance Between the U.S. Patent Classification and the Standard Industrial Classification Systems

Issues in Emerging Health Technologies Bulletin Process

Revised East Carolina University General Education Program

NEES CYBERINFRASTRUCTURE: A FOUNDATION FOR INNOVATIVE RESEARCH AND EDUCATION

RECOMMENDATIONS. COMMISSION RECOMMENDATION (EU) 2018/790 of 25 April 2018 on access to and preservation of scientific information

1. Context. 2. Vision

Library Special Collections Mission, Principles, and Directions. Introduction

2. What is Text Mining? There is no single definition of text mining. In general, text mining is a subdomain of data mining that primarily deals with

Trusted Data Intermediaries

ART COLLECTION POLICY

Consumer and Community Participation Policy

CONSIDERATIONS REGARDING THE TENURE AND PROMOTION OF CLASSICAL ARCHAEOLOGISTS EMPLOYED IN COLLEGES AND UNIVERSITIES

Canadian Technology Accreditation Criteria (CTAC) PROGRAM GENERAL LEARNING OUTCOMES (PGLO) Common to all Technologist Disciplines

SURGERY STRATEGIC CLINICAL NETWORK EVIDENCE DECISION SUPPORT PROGRAM. New ideas & Improvements

The World Wide Web of Science and Global Expertise: Democratizing Access to Knowledge?

Independent Communications Authority of South Africa Pinmill Farm, 164 Katherine Street, Sandton Private Bag X10002, Sandton, 2146

REPORT FROM THE COMMISSION TO THE EUROPEAN PARLIAMENT AND THE COUNCIL. on the evaluation of Europeana and the way forward. {SWD(2018) 398 final}

IB Course Syllabus 2015/16 Visual Arts (HL/SL)

CATHOLIC REGIONAL COLLEGE SYDENHAM. Study: Studio Arts

European Charter for Access to Research Infrastructures - DRAFT

Library s role in UP s Digital Humanities (DH) endeavour

Towards a Magna Carta for Data

Knitting the Digital Library Together. Canadian National Heritage Digitization Strategy Foundational Assembly Report

Kyiv National University of Trade and Economics Faculty of Trade and Marketing INFORMATION PACKAGE

Duplication and/or selling of the i-safe copyrighted materials, or any other form of unauthorized use of this material, is against the law.

OPEN BOARD MEETING! Barcelona, 2 July 2015! 17:00 18:00!!

GESIS Leibniz Institute for the Social Sciences

Multidisciplinary education for a low-carbon society. Douglas Halliday, Durham University, UK

Building an Infrastructure for Data Science Data and the Librarians Role. IAMSLIC, Anchorage August, 2012 Linda Pikula, NOAA and IODE GEMIM

Can Linguistics Lead a Digital Revolution in the Humanities?

UCL Institute for Digital Innovation in the Built Environment. MSc Digital Innovation in Built Asset Management

FACULTY SENATE ACTION TRANSMITTAL FORM TO THE CHANCELLOR

Science and mathematics

TeesRep policy document

ESA Iris Programme Analysis & definition of the Satellite System Operations. Briefing 28 July

A FRAMEWORK FOR PERFORMING V&V WITHIN REUSE-BASED SOFTWARE ENGINEERING

TECHNOLOGY, INNOVATION AND HEALTH COMMUNICATION Why Context Matters and How to Assess Context

Transcription:

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum February 27, 2018

The Team Nich Worby Christoph Becker Ian Milligan Librarian, Web Archivist Assistant Professor Director, Digital Curation Institute Associate Professor, Digital Historian Emily Maemura Doctoral Candidate brought together through the Digital Curation Institute McLuhan Centenary Fellowship in Digital Sustainability 2016-17 and supported by SSHRC

Number of Pages per Domain, Canadian Political Parties Collection, 2005-2015 http://lintool.github.io/warcbase/vis/crawl-sites/

Pages from policyalternatives.ca Sept. 2007 - Nov. 2009 Dec. 2009 - May 2015 over 200.000 pages per crawl less than 50.000 pages per crawl

How are web archives made and used? How can we document or communicate this? Creating web archives Using web archives As a web archivist... What do I need to document for researchers using web archives material? As a researcher What do I need to ask about this data to have confidence in the analysis and findings?

Researcher Perspective: Web Archives Research Objects Maemura, Milligan, Becker, 2016. Understanding computational web archives research methods using research objects, Computational Archival Science Workshop at IEEE Big Data 2016 http://hdl.handle.net/1807/74866 How are web archives used by researchers? Can shared conceptual frameworks of research methods used with Web Archives collections help to systematize practices, advance the field, and make it easier to introduce new researchers to the area?

Research Objects in Computational Science Bechhofer et al., 2013, Why Linked Data is Not Enough for Scientists, Future Generation Computer Systems 29(2), February 2013, Pages 599-611 www.researchobject.org

Our Approach Adopting the Research Object (RO) Framework for Research with Web Archives to structure the complex aggregation of computational processes, services, forms of data, contexts, and approaches balancing systematic approaches from computational sciences with humanistic issues of provenance and trust Developed here to characterize three cases of web archives research studies, examples completed by co-author Ian Milligan

Research Objects as a Conceptual Framework ORGANIZATIONAL CONTEXT Ethical and governance approvals, investigators, etc. Acknowledgements QUESTIONS state a problem and/or hypothesis DATA materials studied, those taken as inputs to processes STUDY DESIGN scope, rationale for choices in tools, sources, methodology METHODS the tools, workflows, scripts, processes, settings, configurations, used to perform the analysis ANSWERS publications and presentations RESULTS materials produced derived datasets, visualizations, etc.

Case: GeoCities Community Milligan, I. (2017). Welcome to the Web: The Online Community of GeoCities and the Early Years of the World Wide Web. In R. Schroeder & N. Brügger (Eds.), The Web as History. sole-authored research by a historian no ethics approval since data is publicly available via Wayback Machine research agreement signed with Internet Archive limits sharing and publication of specific derivative datasets ORGANIZATIONAL CONTEXT QUESTIONS ANSWERS STUDY DESIGN DATA, METHODS, & RESULTS

Case: GeoCities Community Research Question: Did GeoCities users have a sense of community? also understanding what tools and approaches historians need to study the web iterative approach, exploratory investigation of archived GeoCities.com data development of Warcbase analytics platform ORGANIZATIONAL CONTEXT QUESTIONS ANSWERS STUDY DESIGN DATA, METHODS, & RESULTS

Case: GeoCities Community DATA Analysis Phase 1 (2013): data from Archive Team torrent (wget geocities.com) METHODS data prep with bash scripts, html2text.py analysis with Mathematica scripts, bash (on sample) data prep and selection done within warcbase analysis of raw link structures popular images ordered by MD5 hash ~1TB ~4TB Analysis Phase 2 (2016): data from Internet Archive end-of-life crawl RESULTS

An Initial Profile for Web Archives Research Objects Disciplinary perspectives of researchers, roles in large teams and partnerships Legal agreements and contracts that impact data sharing and use ORGANIZATIONAL CONTEXT QUESTIONS ANSWERS STUDY DESIGN DATA, METHODS, & RESULTS Motivations for the study and research contributions - for whom is the work relevant? How was data sourced? Collection timeframe(s) formats and interoperability Interpretation of results is an important part of humanities scholarship, described in publication of findings (not simply the results of workflows) Designs vary widely by discipline and conceptual perspectives Include rationale for selecting sources, methods, scope Which scripts, services, software packages used was code published? Are results FAIR, published, citable? How were they validated?

Template and Workshop at RESAW 2017, London ORGANIZATIONAL CONTEXT Discipline(s) of Research Team Funding QUESTIONS Approvals Partnerships Agreements & Contracts STUDY DESIGN Acknowledgements ANSWERS? Research questions and motivations for the study Scope limits of time period, geographic area, etc. Rationale for choices of tools, sources, methods Publications and Presentations of findings DATA, METHODS & RESULTS Data Sources metadata, method of collection Data Preparation workflows and derived datasets Data Analysis methods, config settings, logs Results Generated published figures, data, code

Template and Workshop at RESAW 2017, London ORGANIZATIONAL CONTEXT Discipline(s) of Research Team Funding Approvals Partnerships Agreements & Contracts STUDY DESIGN QUESTIONS Acknowledgements ANSWERS? Research questions and motivations for the study Scope limits of time period, geographic area, etc. Rationale for choices of tools, sources, methods Publications and Presentations of findings DATA, METHODS & RESULTS Data Sources metadata, method of collection Data Preparation workflows and derived datasets Where do these come from, how are they made? Data Analysis methods, config settings, logs Results Generated published figures, data, code

Web Archivist Perspective: Elements of Provenance Maemura, Worby, Milligan, Becker, 2017. Origin Stories: documentation for web archives provenance, Web Archiving Week / IIPC Web Archiving Conference How are web archives made? Which individual decisions are made in web archives practices, and how can we understand and communicate the impacts on resulting collections?

Our Approach Starting with Web Archiving Life Cycle Model (Bragg et al. 2013) as a framework: What decisions are at each life cycle phase? (e.g. Appraisal and selection; Scoping; Data Capture; Storage and Organization; Quality Assurance and Analysis) Studying the process of using Archive-It to create three web archives collections by University of Toronto Libraries (UTL) http://ait.blog.archive.org/files/2014/04/archiveit_life_cycle_model.pdf

Overview of the Collections Studied Canadian Political Parties Pan Am Games Global Summitry Collection Timeframe October 2005 to present (ongoing) February 2015 to December 2016 June 2016 to present (ongoing) Crawl frequency Quarterly (every 3 months) Combination of Daily, Weekly, Monthly and One-time crawls TBD, based on timing of summit events Crawl duration 3 days Varies widely by crawl (from 5 days (currently test hours to days) crawls only) # of Active Seeds 62 434 167 Total data archived* >900 GB >29,000,000 documents >100 GB >3,500,000 documents >400 GB >5,000,000 documents Crawl limits and rules specified Ignore robots.txt Ignore robots.txt Block twitter.com URLs for lang=? Ignore robots.txt

General Workflow for UTL crawls Importance of iterations, different types of crawls: test crawls, production crawls, patch crawls Managing organizational data budget across multiple collections

Three Key Findings 1) Scoping: decisions are made throughout 2) Process: Unforeseen issues arise during a crawl - the actions taken to resolve these issues need to be documented 3) Context: individual decisions interact and are influenced by changes in organizational context and wider environment, impacting the collection over time

Interdependencies and interactions between factors Is a site with robots.txt exclusions captured? Does the technical system allow? Yes, option available in Archive-It, since 2010 Does the legal environment permit? Yes, after Copyright Law Amendment, 2013 Individual curatorial choice to exclude sites with robots.txt for a particular collection or crawl Does organizational policy guide action? Yes, new law interpreted in Permissions Policy, 2014

Elements of Scoping, Process, Context to Document Element Key Questions and Information to Document Motivation What is the purpose of the collection? Has its mandate changed over time? Focus Which geographic, temporal, technical, political, topical and/or social boundaries are defined to scope the collection? Access & Discovery Who is the intended audience? Do they have known characteristics or needs? Which contractual, organizational, legal, or other agreements restrict access? What metadata fields and indexes support discovery? At what degree of granularity (by collection, site, or individual resource)? Which data formats or derivative datasets are available? Seed list What seeds were used in the scoping of the collection? What was the process of discovering and selecting seeds? Crawl timing What is the frequency of crawls? How long do crawls run or what time limit is set? Crawl configuration What settings control the depth of a crawl? For example, settings for capture by distance from original seed. Is the goal to have a more comprehensive or a breadth-focused collection?

Next Steps - my work in Aarhus

Research Questions How are web archives made and used across different contexts? How can we document or communicate this in systematic and comprehensive ways? + How can this be supported by web archives research infrastructure?

Studying work with Netarkivet Creating web archives Using web archives As a web archivist... What do I need to document for researchers using web archives material? As a researcher What do I need to ask about this data to have confidence in the analysis and findings? Seeking specific projects and examples of connections between researchers and archivists at NetLab and the Royal Library

Approach and Methods Initial Exploratory Work: Interviews and observation of work practices Understanding workflows, creating diagrams, timelines Identifying current needs and challenges In-depth Analysis: More targeted discussions and interviews focused on particular data or artifacts Digging into specific examples, connecting data traces to decisions and context of decision space

Anticipated Results and Outcomes Refine, revise, expand framework Report findings at NetLab Forum in May Compare and contrast with second case focused on work in Canada with web archiving and research (Summer/Fall 2018) Put it all together in a dissertation (!) Logics of Order and Aggregation in Web Archiving Systems (working title)

Thank You! Emily Maemura, PhD Candidate Faculty of Information, University of Toronto e.maemura@mail.utoronto.ca This research is supported by the Social Sciences and Humanities Research Council Joseph-Armand Bombardier Canada Graduate Scholarship and the Michael Smith Foreign Study Supplement