Social media corpora, datasets and tools: An overview

Similar documents
Serving the humanities: daydreams and nightmares

CLARIN Corpora for Parliamentary Discourse Research

HORIZON Presentation at Manufuture Perspectives on Industrial Technologies in Horizon 2020 and Beyond

Researchers and new tools But what about the librarian? mendeley.com

ESSnet on Data Collection for Social Surveys Using Multi Modes (DCSS)

WOMEN IN PHYSICS IN THE BALTIC STATES REGION: PROBLEMS AND SOLUTIONS

Tour de CLARIN. Greece CLARIN

DESI Digital Economy and Society Index

WORKSHOP. Sara Bauer Ma, MSc. Computational Linguistics or "How your last tweet will be used against you" September 28, 2018

Latest trends in sentiment analysis - A survey

ORTOLANG: a French infrastructure for Open Resources and TOols for LANGuage

LISTEN A MINUTE.com. Internet. One minute a day is all you need to improve your listening skills.

Common Lab Research Infrastructure for the Arts and Humanities

Sentiment Analysis of User-Generated Contents for Pharmaceutical Product Safety

1. Digital Humanities in the Library: Challenges and Opportunities for Subject Specialists. Copyright 2015 by The Association of College & Research

SEMINAR: Preparing research data for open access

A F.A.I.R. model for Australia s research outputs: emerging policies and new strategies

On the challenges of cross-national comparative research of NLP

What topic do you want to hear about? A bilingual talking robot using English and Japanese Wikipedias

Communicating Framework Programme 7. European Commission Research DG Pablo AMOR

Potential and Limitations of Commercial Sentiment Detection Tools

National and Regional policies for Globalisation and Open Innovation: Synthesis of national correspondents questionnaire replies

MINERVA: IMPROVING THE PRODUCTION OF DIGITAL CULTURAL HERITAGE IN EUROPE. Rossella Caffo - Ministero per i Beni e le Attività Culturali, Italia

LISTEN A MINUTE.com. Culture. One minute a day is all you need to improve your listening skills.

Data users and data producers interaction: the Web-COSI project experience

Demonstration: The Language Application Grid as a Platform for Digital Humanities Research

Spare Parts Catalogue. Version: 2014/11/05a EN

New forms of scholarly communication Lunch e-research methods and case studies

Understanding the city to make it smart

GAMS: More than a Digital Asset Management System

Responsible Research and Innovation (RRI), Science and Technology

AUTOMATION ACROSS THE ENTERPRISE

HOW EARTO CAN SUPPORT THE DEVELOPMENT OF INNOVATION ECOSYSTEMS IN EUROPE

Tour de CLARIN. Poland CLARIN

Digitisation Plan

SSHRC S KNOWLEDGE MOBILIZATION STRATEGY


Can Linguistics Lead a Digital Revolution in the Humanities?

Laurent Romary, Inria DARIAH, director DARIAH - SHAPING EUROPEAN RESEARCH IN THE ARTS AND HUMANITIES

NASA GES DISC data service and data management for the Air Quality community

UK Data Archive Study Number European Quality of Life Survey, Technical Report

UEAPME Think Small Test

Publishing for Impact

Increased Visibility in the Social Sciences and the Humanities (SSH)

Statistics 2.0: From the Data Revolution to the Next Level of Official Statistics. Enrico Giovannini University of Rome Tor Vergata

LISTEN A MINUTE.com. Vitamins. Focus on new words, grammar and pronunciation in this short text.

SERBIA. National Development Plan. November

THE CHALLENGES OF SENTIMENT ANALYSIS ON SOCIAL WEB COMMUNITIES


ITACOSM 2017 WHO ARE THE USERS? AND WHAT SHOULD THEY KNOW? European Social Survey. Bologna, 15 th June 2017

The (In)ability to Triangulate in Data Driven Healthcare Research

First MyOcean User Workshop 7-8 April 2011, Stockholm Main outcomes

for Jacques Dubucs 2016 ESFRI Roadmap Launch Workshop Trieste September 26, 2014 SCI SWG Research Infrastructures for Social and Cultural Innovation

LISTEN A MINUTE.com. Furniture. One minute a day is all you need to improve your listening skills.

Open Data, Open Science, Open Access

Introduction. amy e. earhart and andrew jewell

The European Research Council. The ERC Open Access Working Group Views on Research Data Management and DMPs. Martin Stokhof

DIGITAL CULTURAL HERITAGE

European R&D and innovation policy: state of the art and perspectives

Introducing Elsevier Research Intelligence

Sanna Talja & Pertti Vakkari Scholarly publishing orientations and patterns of print and electronic literature use

e-infrastructures in FP7: Call 9 (WP 2011)

LISTEN A MINUTE.com. Technology. Focus on new words, grammar and pronunciation in this short text.

PROJECT PERIODIC REPORT PUBLISHABLE SUMMARY

Global Libraries Challenges - e-libraries on the Agenda!

Open Science at Web-Scale: Breaking

Presentation of the SME Performance Review 2015/2016

LISTEN A MINUTE.com. Literature. Focus on new words, grammar and pronunciation in this short text.

EUROPEAN COMMISSION DIRECTORATE-GENERAL TAXATION AND CUSTOMS UNION Indirect Taxation and Tax administration Indirect taxes other than VAT

Dissemination, Exploitation & Communication

Why we need a Network of Usage Data Providers - OpenAIRE Impact Metrics Results

An analysis of the state of research and development. co-operation between Greece and Germany

Malmö stad Malmö Museer File Number: KN

LISTEN A MINUTE.com. First Impressions.

Stakeholders Acting Together On the ethical impact assessment of Research and Innovation

Attribution and impact for social science data

Tony Vanchieri, Luke Sebby and Gary Dooley

TECHNOLOGY, ARTS AND MEDIA (TAM) CERTIFICATE PROPOSAL. November 6, 1999

ArkPSA Arkansas Political Science Association

The EOSC Portal. Accelerating the transition to open science & open innovation

Columba oenas. Report under the Article 12 of the Birds Directive Period Annex I International action plan. No No

LISTEN A MINUTE.com. Focus on new words, grammar and pronunciation in this short text.

InterPARES Project. The Future of Our Digital Memory. The Contribution of the InterPARES Project to the Preservation of the Memory of the World

EXTENDED VALIDITY OF RESERVE LISTS. Category Lists End Date

Manuscript Transcription by Crowdsourcing: Transcribe Bentham

LISTEN A MINUTE.com. Advice. One minute a day is all you need to improve your listening skills.

List of harmonised and national restriction codes

Experiences from the Social Sciences - possible links to Health Data?

ESA PATENTS PORTFOLIO AUDE DE CLERCQ

American Lessons : Interdisciplinarity, Multimediality, Diachronic Analysis. di Michela Minesso

Greta Franzini February 23, 2016 National University of Ireland, Galway

Preface. Egidija Ramanauskaitė

LISTEN A MINUTE.com. Volunteerism. Focus on new words, grammar and pronunciation in this short text.

Developing Space Weather Products and Services in Europe ES0803

SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES American Community Survey 5-Year Estimates

EUROPEAN FRAMEWORK PROGRAMME FOR RESEARCH AND INNOVATION

The role of SciELO on the road towards the Professionalization, Internationalization and Financial Sustainability of developing country journals

ISO INTERNATIONAL STANDARD. Technical product documentation Lettering Part 4: Diacritical and particular marks for the Latin alphabet

From Observational Data to Information IG (OD2I IG) The OD2I Team

The Tool Box of the System Architect

Transcription:

Social media corpora, datasets and tools: An overview Darja Fišer Director for User Involvement CLARIN ERIC Darja.Fiser@ff.uni-lj.si Jakob Lenardič Assistant to Director for User Involvement CLARIN ERIC jakob.lenardic@ff.uni-lj.si CLARIN-PLUS workshop "Creation and Use of Social Media Resources" Kaunas, Lithuania 18-19 May 2017

CLARIN in five bullets CLARIN is the Common Language Resources and Technology Infrastructure that provides easy and sustainable access for scholars in the humanities and social sciences and beyond to digital language data (in written, spoken, video or multimodal form), and advanced tools to discover, explore, exploit, annotate, analyse or combine them, wherever they are located through a single sign-on online environment. 2

CLARIN ERIC: 19 members, 2 observers, 1 associated partner 3

Prehistory of this workshop H2020 project CLARIN-PLUS: outreach to new users, focus on four specific data types oral history collections newspaper archives parliamentary records social media data Joint proposals and research collaboration national initiatives R&D proposals (FP7, H2020, COST) international projects 4

Long-term vision easy access to social media material services suited for this type data can easily be found and employed encourage researchers to develop and address disciplinespecific hypotheses and scholarly questions 5

Challenges and multidisciplinary potential Social media data sets are considered a rich data type that is suited for both close reading and distance reading is often presenting itself as messy or noisy data is calling for links with data in other modalities than text under specific circumstances that need to be well understood before strong conclusions can be drawn Social media data sets have a big potential for reuse and re-purposing within many fields of study in the humanities and social sciences (and beyond): Humanities: language variation and change, discourse analysis, Social sciences: social and cultural dynamics, political sciences, economics, 6

Aims for this workshop explore existing and envisoned approaches for analyzing social media records (text, multimedia) with the use of CLARINcompatible standards and processing tools generation of an overview of relevant resources to stimulate synergy and cross-country collaboration creation of an action plan 7

Corpora Surveyed for all CLARIN members 15 corpora identified 12 different lang (de, uk, ee, fi, nl, lt, cy, fr, no, pl, it, si) most common data types: forums, blogs, tweets 4 available through concordancer, 4 for download, 2 both download and concordancer, 5 unavailable License info available for 7 corpora 4 under CC BY: Suomi24 Sentence Corpus, Dortmund Chat Corpus, CoMeRe, DWDS- Blogs Re-licensing forbidden: Monitor Corpus of Austrian Tweets ACA_CLARIN-LT_End-User-Licence-Agreement_EN-LT: LITIS Restricted in accordance with Twitter Terms of Use: Corpus of Welsh Language Tweets Most for German (4 corpora) 6 corpora available through the CLARIN infrastructure ee: Mixed Corpus: New Media fi: Suomi24 lt: LITIS nl: SoNaR New Media Corpus no: NTAP de: Dortmund Chat Corpus CLARIN ESTONIA VLO VLO VLO CLARINO VLO CLARIN-PLUS Workshop 2017 Darja Fišer 8

Overview of corpora (1/2) Lang Name of corpus Data types Size Period Anno Avail Found German Dortmund Chat Corpus Chats 1,06m / T,P,L / VLO German DEREKO subcorpus News & German Wiki 670m / T D,C P.C. German, English Monitor corpus of tweets from Austrian users Tweets 30-40m 2007-2017 T, L / Google German DWDS subcorpus Blogs Blogs 102m / / C P.C. Estonian Mixed Corpus: New Media Forums, chats, comments 25m 2000-2008 T D, C CLARIN Estonia Finnish Suomi 24 Forums 2,600m 2001-2016 T,P C VLO Lithuanian LITIS v.1 News 190k cmnts Dutch Dutch Welsh Norwegian, English, French SoNaR New Media Corpus Flemish online teenage talk Corpus of Welsh Language Tweets NTAP climate change blog corpora 2010-2014 / D VLO Tweets, chats, SMS 35m 2005-2012 T,P,L C VLO Facebook, Whatsapp 2.9m 2015-2016 T / Survey Tweets Blogs related to climate change 7m tweets / / D P.C. 5,000m 2000-2014 T C CLARINO Polish Corpus Highly Emotive Tweets 160m / T D Google

Overview of corpora (2/2) Lang Name Data types Size Period Anno Avail Found French CoMeRe Repository Emails, forums, chats, tweets, Wiki, etc. Italian Web2Corpus_it Forums, Blogs, Newsgroups, social networks, chats Slovenia n 75-80m JANES Slovene CMC 200m 2013-2016 Various / D P.C. / / T,P,L / P.C. T,P,L / CLARIN.SI - Italian Web2Corpus_it and Slovene JANES are still in preparation - Monitor corpus of tweets from Austrian users and Flemish Online Teenage Talk are unavailable. CLARIN-PLUS Workshop 2017 Darja Fišer 10

Problems Missing metadata Unknown temporal span for DWDS, DEREKO, Dortmund Chat, Corpus of Welsh Language Tweets Unknown annotation process for DWDS, LITIS, Corpus of Welsh Language Tweets Licence info Unclear for most of the surveyed corpora CLARIN Annual Conference 2016 SAB Darja Fišer 11

Datasets 17 datasets identified by language: 9 different languages (cz, dk, el, de, it, es,se, si, uk) 1 multilingual most for Slovene (6), English (3) and Italian (3) by data type: Tweets (10) Facebook comments (2) mixed (3) blogs (1) Reddit (1) by task: sentiment analysis (5), NER (1), entity linking (1), rest miscellaneous 8 of these integrated in the CLARIN infrastructure CLARIN-PLUS Workshop 2017 Darja Fišer 12

Tools Within the CLARIN infrastructure: GATE tools (CLARIN-UK) JANES tools (Clarin.si) Elsewhere: Hunaccent (Hungarian) Accentizer of Hungarian text Twython (language-independent) Python wrapper for the Twitter API dmi-tcat A set of tools to retrieve and collect tweets from Twitter for statistical analysis Tweet NLP A tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. CLARIN-PLUS Workshop 2017 Darja Fišer 13

Lessons learned User ambitions tend to be conservative, so. a bit of technology push can be good, but...... the functionality that tools have to offer should support users in the workflows they know, rather than steer the exploration of data or the application of tools in ways that are not understood, so user needs should be kept in focus. Scholarly insights and conlusions without modes for validating and/or replicating the results have difficulty to gain trust, so black boxes have little added value For collaboration across disciplinary boundaries, communication pitfalls will never stop to exist, so keep talking after this workshop! 14

CLARIN: Infrastructural support for the study and use of language as social and cultural data darja.fiser@ff.uni-lj.si CLARIN-PLUS workshop "Creation and Use of Social Media Resources" Kaunas, Lithuania 18-19 May 2017