The Enriched TreeTagger System

Similar documents
Part of Speech Tagging & Hidden Markov Models (Part 1) Mitch Marcus CIS 421/521

Exploring the Political Agenda of the Greek Parliament Plenary Sessions

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Study Singular They in Contemporary English. Bich Ngoc Do

HORIZON HIGH SCHOOL- English Composition, Grammar and Poetry

Latest trends in sentiment analysis - A survey

Battling Grammar Rule book Version 3.1

EPISODE 21: WATCHING BIRDS. Hello, I m Margot Politis. Welcome to Study English, IELTS preparation.

Some notes on Constituency Tests

ABHINAV BHARATI HIGH SCHOOL SYLLABUS FOR ACADEMIC SESSION Class VI

End-Of-Unit Assessment Student Booklet for Adjectives Student Booklet for Adverbs Student Booklet for Prepositions

I Can Common Core! 1st Grade Math. I Can Use Addition and Subtraction to Help Me Understand Math

Treebanks. LING 5200 Computational Corpus Linguistics Nianwen Xue

Student name: Class: Date:

Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado)

JocondeLab. DGLFLF Brigitte TRAN. Délégation générale à la langue française et aux langues de France

Building a Business Knowledge Base by a Supervised Learning and Rule-Based Method

SUBJECT PERIODIC ASSESSMENT-1 Half-Yearly

Social media corpora, datasets and tools: An overview

Contents. No. Topic Name Page No. Unit One. 1 What a Bird Thought 1. 2 Daydreams 4. 3 Be a Good Listener Strawberries The Twelve Months 16

ENG005 Grammar Punctuation. Text: English 2600 (Sixth Edition) by Joseph Blumenthal COURSE OUTLINE

E-BOOK // OF COUNTABLE AND UNCOUNTABLE NOUNS USER GUIDE

COURSE UNITS TAUGHT IN ENGLISH :: UNIVERSITY OF COIMBRA :: ACADEMIC YEAR 2009/2010

International Nuclear Information System (INIS)

I Can Read. (Reading Foundational Skills) I can read words by using what I know about letters and sounds.

DON BOSCO SCHOOL MONTHWISE SYLLABUS CLASS 6 SUB: ENGLISH

CONTENTS. iii. SECTION 1 Painting Brush Strokes... #

7. Come along with us if you want an ice-cream cone.

EACL th Conference of the European Chapter of the Association for Computational Linguistics

Language, Context and Location

Kindergarten I Can Standards. Graphics by Coffee, Kids and Compulsive lists at

GETTING STARTED. Features. LCD Marks and Their Meanings

Introduction. Description of the Project. Debopam Das

This skills covered in this unit will help prepare students for the AQA English Language exam Paper 1: Sections A & B

Scope and Sequence 1 WRITING ABOUT YOURSELF 2 WRITING ABOUT YOUR FAMILY AND FRIENDS 3 WRITING ABOUT YOUR ACTIVITIES. iv Scope and Sequence

Original Script Cataloging at the Library of Congress: Past, Present, and Future

COMMERCIAL DATA SHEET

Launching Reading Workshop with Literary Nonfiction / Writing Workshop / Citizenship. L2: Reading Resolutions & L3: Library Routines

A Comparison of Chinese Parsers for Stanford Dependencies

Mining and Estimating Users Opinion Strength in Forum Texts Regarding Governmental Decisions

Contents. viii. Introduction: What This Book Will Do for You NARRATION

Writing Lesson for Art Students Grades 4 and 5 Haiku in Clay Paul Guider, Shady Grove Elementary Henrico County

Introduce yourself! VOCABULARY. LEVEL NUMBER LANGUAGE Beginner A1_1011V_EN English. 1

Automated Generation of Timestamped Patent Abstracts at Scale to Outsmart Patent-Trolls

English Curriculum. Beginner Lesson 11

Technology Competency Descriptors Students will be able to identify, compare, and utilize appropriate technological applications.

The Moog Center. for Deaf Education. Where Deaf Children Talk South Forty Drive, St. Louis MO 63141

VEGAFLEX 80 two-wire 4 20 ma/hart

Great Writing 1: Great Sentences for Great Paragraphs Peer Editing Sheets

population onomastic databases

GCE A level/as Subjects Recognised for NUI Matriculation Purposes (May 2017)

Drawing with Chalk Go outside and draw with chalk on the driveway or sidewalk. Make pictures, letters, and shapes.

Digging Deeper, Reaching Further. Module 5: Visualizing Textual Data An Introduction

GCE A level/as Subjects Recognised for NUI Matriculation Purposes (September 2018)

SAMPLE. Script- n-scribe. TRUTH TRAVELER by Brandy Ferrell. Truth, Work, and Redemption through the Ages

Sixth Grade ELA Pacing Guide. Unit and Week Title Genre Target Concepts

Ranking the annotators: An agreement study on argumentation structure

Statistics on Weed Lists at (Note: These are NOT final statistics. These statistics are prior to the return of boxes by Better World Books.

Spell Well! Letter Tiles

Two Bracketing Schemes for the Penn Treebank

Stirring Up Sentence Sense!

MazeQuest: Tales of the Wandering Grammarian

Green View Public School SYLLABUS PLANNER Session ( ) Class: VII Subject: ENGLISH

Seventeenth Annual University of Oregon Eugene Luks Programming Competition

Introduction to Markov Models. Estimating the probability of phrases of words, sentences, etc.

WINONA STATE UNIVERSITY PROPOSAL FOR GENERAL EDUCATION PROGRAM COURSES

3/5/2010. Li8 Lent term, week 8

Biographical data exploration as a test-bed for a multi-view, multi-method approach in the Digital Humanities

Overview of the Rotary Switch Settings for SMA PV inverter, status 07/2015 SUNNY BOY / SUNNY TRIPOWER

The Study of Patent Prior Art Retrieval Using Claim Structure and Link Analysis

I Can Read. (Reading Foundational Skills)

Using Named Entity Recognition as a Classification Heuristic

A Collaboration with DARCI

LINGUISHTIK Tournament Rules

I Can Read. (Reading Foundational Skills) I can read words by using what I know about letters and sounds.

DREaM Catherine Nygren. DREaM Report April 14, i. Introduction

Applying Text Analytics to the Patent Literature to Gain Competitive Insight

Using forced alignment and HTML5 media syntax to share speech archive data. John Coleman. Phonetics Laboratory, Oxford

Review helpfulness as a function of Linguistic Indicators

MLAG LINGUISHTIK Tournament Rules

The Mercenary Writer's Guides: Manuscript Critique and Grammar Rx

Legal Texts Summarization by Exploration of the Thematic structures and Argumentative Roles

Crafting the Classroom

Week 2. Week 1. Week 3. Week 4. Using Continental s Jump Start... 4

Norwich City Schools French

English: Contemporary Aboriginal Voices, Grade 11, Workplace Expectations

Grade 6 Unit 1: Nonfiction Newspapers and Magazines

Writing Up Your Research

Final activity. Notes for the teacher

HERITAGE ELEM SCHOOL. Analysis Overview. Collection Information Date of Analysis: 03-Dec :45:35

Lista de lucrari. A. Articole in reviste cu referenti

A great book to use if you have a topic theme of explorers or for writing in a different way about a residential visit.

Local Language Computing Policy in Korea

WORDS: SUFFIXES V-3 SUFFIX MEANING EXAMPLE USED TO FORM NOUNS. -age state, condition, act, quality percentage

RELATIONSHIP BETWEEN TRUST AND ENTRAINMENT IN SPEECH. FA , PI Stefan Benus

I Can Common Core! 2 nd Grade Reading. I Can Read Fiction

Workbook Book 1 What are they doing at this time? Plant Magic I wish I were. The Lion and the Mouse (Formative Orals) Sleep is Necessary

How Can I Practice? $20,000 < SALARY < $50, years. 24 More than Total. i. 12 years of education and makes more than $100,000.

INIS: the world s largest nuclear information system

CSE - Annual Research Review. From Informal WinWin Agreements to Formalized Requirements

Transcription:

The Enriched TreeTagger System H. Schmid, M. Baroni, E. Zanchetta, A. Stein Universities of Stuttgart, Trento and Bologna (Forlì) Evalita Workshop Roma - September 10, 2007 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 1/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 2/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 3/ 17

H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Developed at the University of Stuttgart over a decade ago by Helmut Schmid H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) Supports multiple languages (Bulgarian, Dutch, English, (old) French, German, Italian, Spanish, Russian) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) Supports multiple languages (Bulgarian, Dutch, English, (old) French, German, Italian, Spanish, Russian) Rather fast (50.000 tokens/second on a typical laptop) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 5/ 17

Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17

Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17

Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) 500.000+ word forms and 30.000+ lemmas H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17

Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) 500.000+ word forms and 30.000+ lemmas Human readable list of word forms with their lemma and features H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17

Morph-it! Form Lemma Features rimpinzeremmo rimpinzare VER:cond+pre+1+p abominevole abominevole ADJ:pos+m+s dabbenaggine dabbenaggine NOUN-F:s ostensibilmente ostensibilmente ADV H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 7/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking Manual compilation of exception lists H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking Manual compilation of exception lists Different methods for the different categories we identified: verbs, adjectives, nouns, adverbs and function words H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 9/ 17

POS Guesser: 157 classes of words were manually developed. Each class is defined by a regular expression. The probability of some tag t given an unknown word of class c is defined as the average probability of t among the known words of word class c. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 10/ 17

POS Guesser: 157 classes of words were manually developed. Each class is defined by a regular expression. The probability of some tag t given an unknown word of class c is defined as the average probability of t among the known words of word class c. DISTRIB: the model for this tagset uses an internal tagset created amalgamating the DISTRIB and EAGLES tagsets. The training set was automatically generated from the EAGLES and DISTRIB versions. A simple Perl script then maps the output to the DISTRIB tagset H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 10/ 17

EAGLES: internally assigned special tags to the adverb non and the subordinating conjunction che H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 11/ 17

EAGLES: internally assigned special tags to the adverb non and the subordinating conjunction che EAGLES: post processing rules developed by studying the errors made by the TT H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 11/ 17

: post processing rules se + ne personal pronoun H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction ieri + preposition or altro noun H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction ieri + preposition or altro noun più + noun adjective H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 13/ 17

Configuration EAGLES DISTRIB TT 96.27 96.18 TT + RegExpGuess 97.08 96.86 TT + ExtLex 97.61 97.25 TT + RegExpGuess + ExtLex 97.76 97.37 TT + RegExpGuess + ExtLex + Rules 97.89 NA Table: Test set percentage accuracy of TreeTagger in various configurations (highest accuracies are those of TT + RegExpGuess + ExtLex + Rules submitted to EVALITA) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 14/ 17

Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 15/ 17

H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17

H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid. improvements in part-of-speech tagging with and application to German. Proceedings of the ACL SIGDAT-Workshop, 1995. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17

H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid. improvements in part-of-speech tagging with and application to German. Proceedings of the ACL SIGDAT-Workshop, 1995. E. Zanchetta, and M. Baroni. Morph-it! A free corpus-based morphological resource for the Italian language. Proceedings of Corpus Linguistics 2005, 2006. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17

Links TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/ TreeTagger Morph-it! http://sslmit.unibo.it/morphit H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 17/ 17