Scholarly Requirements for Large Scale Text Analysis

Similar documents
TITLE: Using collections and worksets in large-scale corpora: Preliminary findings from the Workset Creation for Scholarly Analysis project

Digging Deeper, Reaching Further. Module 5: Visualizing Textual Data An Introduction

New Directions in Digital Library Publishing: Increasing Access to Non-Textual Cultural Narratives

Introduction. amy e. earhart and andrew jewell

Beyond the Scanned Image: A Needs Assessment of Scholarly Users of Digital Collections

Enhancing Access to Media Collections and Archives Using Computational Linguistic Tools

2. What is Text Mining? There is no single definition of text mining. In general, text mining is a subdomain of data mining that primarily deals with

LIS 688 DigiLib Amanda Goodman Fall 2010

The NEW IUScholarWorks at Indiana University. Repositories, Journals, and Scholarly Publishing

HUMANITIES, ARTS & CULTURE DATA SUMMIT. Rachel Fensham Digital Studio, University of Melbourne

Libraries and IT: Services Supporting Research at NC State Jill Sexton Interim Associate Director for the Digital Library NCSU Libraries April 25,

Building an Infrastructure for Data Science Data and the Librarians Role. IAMSLIC, Anchorage August, 2012 Linda Pikula, NOAA and IODE GEMIM

BHL Moves Forward 2014 an update

The Cyberne*cs Thought Collec*ve Project

Packaging Science Information Access Policy Clemson University Libraries

2018 NISO Calendar of Educational Events

CONSIDERATIONS REGARDING THE TENURE AND PROMOTION OF CLASSICAL ARCHAEOLOGISTS EMPLOYED IN COLLEGES AND UNIVERSITIES

Can Linguistics Lead a Digital Revolution in the Humanities?

The value of libraries has been a prominent topic in library literature over the last five years with much emphasis placed on developing assessment

Mindfulness, Behaviour Change and Engagement in Public Policy. Rachel Lilley and Mark Whitehead Aberystwyth University

What is a collection in digital libraries?

Reflection Guide for Interns

HSS Scholars & Scientists Workgroup Report

Digitisation Plan

RECOMMENDATIONS. COMMISSION RECOMMENDATION (EU) 2018/790 of 25 April 2018 on access to and preservation of scientific information

Existing infrastructures for data services in Western Balkans

Text Analysis of Kurt Vonnegut with the HathiTrust & Voyant. Tassie Gniady David Kloster Guangchen Ruan Robert McDonald

IDCC18 Research Paper

INTEGRATED DATABASE PROJECT

Do We Approve? New Models for Assessing Approval Plans

The Specimen Case and the Garden: Preserving Complex Digital Objects, Sustaining Digital Projects

Evolution of Data Creation, Management, Publication, and Curation in the Research Process

Why visualize library data? Why invest

As early as 1983, Jo Bell Whitlatch wrote, Speculation concerning the future

Bamboo Technology Proposal (Public)

1. Digital Humanities in the Library: Challenges and Opportunities for Subject Specialists. Copyright 2015 by The Association of College & Research

Computational Reproducibility in Medical Research:

Center for Open Data in the Humanities (CODH): Activities and Future Plans

Connecting Museum Collections with the Rest of the World

J A M E S C O S U L L I VA N J O S U L L I VA N. O R G U N I V E R S I T Y O F S H E F F I E L D

A Digitisation Strategy for the University of Edinburgh

Citizen Science, University and Libraries

Trends in. Archives. Practice MODULE 8. Steve Marks. with an Introduction by Bruce Ambacher. Edited by Michael Shallcross

Inventory of the Bernard Warshaw Holocaust Atrocity Photographs, , circa 2000

Gastronomy in the Gallery:

STRATEGIC FRAMEWORK Updated August 2017

Unique Collections: Local, Consortial, and Beyond

Research data management at the University of Oslo

Enhancing Opportunities for Diversifying STEM Faculty. Carla Fehr Trina Ramirez Sharon R. Bird

Understanding User s Experiences: Evaluation of Digital Libraries. Ann Blandford University College London

Conservation s Digital Landscape: one conservator s perspective. Nancie Ravenel Objects Conservator, Shelburne Museum

ALA s Core Competences of Librarianship

Clay Shirky. Cognitive Surplus: How Technology Makes Consumers into Collaborators. New York: Penguin Press, p. Index and notes. $16.

Library Special Collections Mission, Principles, and Directions. Introduction

The Library's approach to selection for digitisation

Language, Context and Location

Interview with Sheree Hovsepian

A MODEL OF SCHOLARLY COMMUNICATION IN TOURISM AND AN OPEN ACCESS INITIATIVE

University of Kansas. The University of Kansas Libraries

The Information-Seeking Behavior of Users in Art and Museum Libraries: An Annotated Bibliography

Revised East Carolina University General Education Program

Academia. Elizabeth Mezzacappa, Ph.D. & Kenneth Short, Ph.D. Target Behavioral Response Laboratory (973)

Social Networks and Archival Context R&D to Cooperative

2018 NISO Calendar of Educational Events

Thoughts on Reimagining The University. Rajiv Ramnath. Program Director, Software Cluster, NSF/OAC. Version: 03/09/17 00:15

Exploring and Visualizing Patterns in Literary Text Collections with FeatureLens

Researchers and new tools But what about the librarian? mendeley.com

Inventory of the Robert Turner Holocaust Atrocity Photographs, 1945

Living on the LAM: Libraries, Archives and Museums in the Digital Age

THE ATLAS OF NEW LIBRARIANSHIP

OPEN JOURNAL ONLINE SYSTEMS (OJS)

Elements of Scholarly Discourse in a Digital World

Reframing Collections for a Digital Age: A Preparatory Study for. Collecting and Preserving Web-based Art Research Materials

Plum Goes Orange Elsevier Acquires Plum Analytics - The Scho...

Selection and Acquisition of Materials for Digitization in Libraries 1

INVOLVING USERS TO SUCCESSFULLY MEET THE CHALLENGES OF THE DIGITAL LIBRARY: A 30 YEAR PERSONAL REFLECTION

REALLY, REALLY RAPID PROTOTYPING: FLASH BUILDS & USER-DRIVEN INNOVATION

Inventory of the Charles Cross Collection of Buchenwald Concentration Camp Photographs, 1945

Writing for Publication [Video]

Social media corpora, datasets and tools: An overview

Foreword...i Table of Contents... iii List of Figures...vi List of Tables...vi. Executive Summary...vii

A STUDY ON THE DOCUMENT INFORMATION SERVICE OF THE NATIONAL AGRICULTURAL LIBRARY FOR AGRICULTURAL SCI-TECH INNOVATION IN CHINA

Digital humanities and STEM librarianship, or why I stopped rolling my eyes at word clouds

Developing Research Platforms New Roles for New Libraries

In modern economic and social relations, in the process of material goods production implementing production innovative means, improving the existing

Research Data Preservation in Canada A White Paper

Deconstructing Digital Libraries. Neil Jefferies R&D Project Manager Systems & eresearch Service (SERS) Bodleian Libraries, Oxford University

Digital Comics Database

Try the Veggie Portobello Mushroom Burger. It s the only thing that s been able to pull me out of my research paper depression today.

Research Reproducibility in Control and Systems Engineering

THE MASSACHUSETTS HISTORICAL SOCIETY STRATEGIC PLAN,

Social Role of Libraries in the Development of Information Society and the Policy of State Education in Latvia

Web 2.0 in social science research

DRM vs. CC: Knowledge Creation and Diffusion on the Internet

Hyper Human Exhibition

What Do Librarians Want? How Google Has Changed Traditional Expectations

Open Science. challenge and chance for medical librarians in Europe.

Social Big Data. LauritzenConsulting. Content and applications. Key environments and star researchers. Potential for attracting investment

Using GIS to Manage Philadelphia s Archival Photographs

REPORT FROM THE COMMISSION TO THE EUROPEAN PARLIAMENT, THE COUNCIL, THE EUROPEAN ECONOMIC AND SOCIAL COMMITTEE AND THE COMMITTEE OF THE REGIONS

Transcription:

Scholarly Requirements for Large Scale Text Analysis A USER NEEDS ASSESSMENT FOR THE HATHITRUST RESEARCH CENTER Harriett Green, Eleanor Dickson, and Angela Courtney Charleston Library Conference, November 5, 2016

What is the HathiTrust Research Center? http://www.hathitrust.org/htrc Jointly led by the University of Illinois at Urbana-Champaign and Indiana University Bloomington Facilitates text analysis of HTDL content Focus on large-scale, computational research Research & Development Finding technical solutions Building tools and services Conducting user studies

Scholarly Practices with Digital Collections and Tools How humanities scholars use digital collections: Brockman et al., 2001; Palmer and Neumann, 2002; Babeu 2011; Rutner and Schonfeld, 2011; Green and Courtney, 2015 How humanities scholars use digital tools: Frischer et al., 2006; Warwick 2008; Toms and O Brien, 2008; Gibbs and Owen, 2012 Tools and resources for textual analysis: ARTFL and Philologic (Argamon et al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer (Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010), and Lexos (LeBlanc et al., 2013)

Workset Creation for Scholarly Analysis Mellon Foundation funded research project, 2013-2014 GOAL: Find out how researchers collect together digital materials and build textual corpora for research purposes. Findings (Green et al. 2014, Fenlon et al. 2014): Need the ability to create and manipulate collections as reusable datasets and research products The ability to work at different units of analysis Access to highly enriched metadata

HTRC User Requirements Study: Research Goals Learn how researchers use digitized textual corpora, apply relevant methods and approaches, and seek needed tools Develop illustrative use cases of text analysis research that will help shape the development and expansion of HTRC research services and training curricula for scholars Obtain information that can inform development of text analysis data providers and research services

HTRC Users Requirements Study: Methods Recruited interviewees from professional conferences and meetings on digital libraries and digital humanities Semi-Structured Interviews with 15 scholars All interviews transcribed and coded in ATLAS.ti 7 by HTRC Scholarly Commons members Qualitative content analysis

Findings What are scholars needs and practices for conducting textual analysis with large text archives? Research Practices and Culture Textual Data Acquisition and Use Generating and Sharing Results

Research Practices and Culture When I say people have been studying this time period for 300 years people who are much smarter than me, better writers, have better access to the archives, who can read more than I can the only way we can say something new is if we get new perspective on old data. I m not worried about publishing venues, I m not worried about reproducibility, I m not worried about statistics. My own knowledge of that is pretty good. But the collaborative work style is really hard. I wish more people were archiving their data and their algorithms from the source code, as you see CS papers that will benchmark results against a dataset. Then how do you try to replicate or beat those results? It becomes impossible to evaluate your own methods against theirs and really slows down the pace of research, because if one could surpass state of the art, then that s an application and [a] step forward.

Data Acquisition and Use "We need that corpus, and we need basically the data exposable. I mean we have to be able to view, not only manipulate, but also view the data. The newspaper archives that exist... all have their own siloed, siphoned search system and metadata collection mechanisms, and these things do not necessarily always talk to each other, try as I will to get everything into Zotero in some kind of unified form. "I did work with ProQuest and the New York Times. I had an article that came out in an academic journal and I worked with them to get permission to use an image before, so I know that it is possible. But the process was so long and it was for three images. I m going to have thousands of files."

Where Do They Get Data and How Do They Analyze It? Data Sources HathiTrust Digital Library Internet Archive Self-digitized content Early English Books Online JSTOR ARTFL Google Scholar English Short Title Catalog Project Gutenberg ProQuest Historical Newspapers Tools Voyant Juxta Gephi Tableau MorphAdorner D3.js MALLET Zotero Weka Python: NLTK, SciKit Learn

Generating and Sharing Results My general like philosophical approach to these things is I like to do things small. I build my corpora. I like to read them myself. I m a little weary of like big distant reading approaches, especially with stuff as far away from the present as my stuff. So I m still trying to perfect the stuff that I m currently doing. Document navigation would be extremely helpful, and that s the kind of thing that people have to do a lot of: searching, bookmarking, grouping things, and looking at several segments together. In some ways GitHub is an integral part of this. We can try to describe this code, or you can go look at our code, so it s interesting in that if you read the paper without actually looking at the code, you ve gotten sort of a broad overview of the method, but you couldn t replicate it So it s a bit of a hybrid publication.

Findings: User Personas Digital Project Librarian Faculty Member Graduate Student Credits: Alex Kinnaman, Peter Organisciak, Eleanor Dickson Wants flexible, transparent tools Role: Research Support staff Challenges: Inaccessible data, matching tool to researcher Wants computational resources Role: Experienced Researcher Challenges: Collaboration, Finding texts Wants examples Role: New Researcher Challenges: Understanding stats, choosing areas of interest

Looking Forward Expanding study with interviews with social scientists IMLS-funded Digging Deeper, Reaching Further: Libraries Empowering Users to Mine the HathiTrust Digital Library : http://teach.htrc.illinois.edu HT + Bookworm development (NEH) Data Capsule development (Mellon) Revision to HTRC Portal and Workset Builder

Interested in working with HTRC? http://www.hathitrust.org/htrc HTRC Announcements: htrc-announce-l @ list.indiana.edu Questions? htrc-help@hathitrust.org HTRC User Group: htrc-usergroup-l @ list.indiana.edu Advanced Collaborative Support program: htrc.acs.awards@gmail.com

HTRC Scholarly Commons Team University of Illinois: Beth Sandore Namachchivaya Stephen Downie Harriett Green Eleanor Dickson Peter Organisciak Alex Kinnaman (graduate assistant) Sayan Bhattacharyya (former CLIR postdoc) Indiana University: Robert McDonald Angela Courtney Nicholae Cline Leanne Nay Ewa Zegler-Poleska

Thank you! Harriett Green green19@illinois.edu @greenharr Eleanor Dickson dicksone@illinois.edu @elliedickson Angela Courtney ancourtn@indiana.edu @englishlitlib