The Enriched TreeTagger System H. Schmid, M. Baroni, E. Zanchetta, A. Stein Universities of Stuttgart, Trento and Bologna (Forlì) Evalita Workshop Roma - September 10, 2007 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 1/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 2/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 3/ 17
H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Developed at the University of Stuttgart over a decade ago by Helmut Schmid H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) Supports multiple languages (Bulgarian, Dutch, English, (old) French, German, Italian, Spanish, Russian) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Developed at the University of Stuttgart over a decade ago by Helmut Schmid Is Markov Model based Publically available (http://www.ims.uni-stuttgart.de/projekte/ corplex/treetagger) Supports multiple languages (Bulgarian, Dutch, English, (old) French, German, Italian, Spanish, Russian) Rather fast (50.000 tokens/second on a typical laptop) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 4/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 5/ 17
Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17
Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17
Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) 500.000+ word forms and 30.000+ lemmas H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17
Morph-it! Developed 4 years ago at the University of Bologna (Forlì) by Marco Baroni and Eros Zanchetta Freely available (Creative Commons and LGPL) (http://sslmit.unibo.it/morphit) 500.000+ word forms and 30.000+ lemmas Human readable list of word forms with their lemma and features H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 6/ 17
Morph-it! Form Lemma Features rimpinzeremmo rimpinzare VER:cond+pre+1+p abominevole abominevole ADJ:pos+m+s dabbenaggine dabbenaggine NOUN-F:s ostensibilmente ostensibilmente ADV H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 7/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking Manual compilation of exception lists H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Morph-it! built using Repubblica, a large corpus (380 million tokens) of newspaper Italian and a 25-million-tokens web corpus created using the BootCat tools (seed words were extracted from Repubblica ) Both corpora were annotated with lemma and POS tags using an earlier version of TreeTagger methodology: Lemma extraction Generation of inflected forms Manual checking Manual compilation of exception lists Different methods for the different categories we identified: verbs, adjectives, nouns, adverbs and function words H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 8/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 9/ 17
POS Guesser: 157 classes of words were manually developed. Each class is defined by a regular expression. The probability of some tag t given an unknown word of class c is defined as the average probability of t among the known words of word class c. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 10/ 17
POS Guesser: 157 classes of words were manually developed. Each class is defined by a regular expression. The probability of some tag t given an unknown word of class c is defined as the average probability of t among the known words of word class c. DISTRIB: the model for this tagset uses an internal tagset created amalgamating the DISTRIB and EAGLES tagsets. The training set was automatically generated from the EAGLES and DISTRIB versions. A simple Perl script then maps the output to the DISTRIB tagset H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 10/ 17
EAGLES: internally assigned special tags to the adverb non and the subordinating conjunction che H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 11/ 17
EAGLES: internally assigned special tags to the adverb non and the subordinating conjunction che EAGLES: post processing rules developed by studying the errors made by the TT H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 11/ 17
: post processing rules se + ne personal pronoun H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction ieri + preposition or altro noun H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
: post processing rules se + ne personal pronoun se + verb subordinating conjunction ci + essere adverb l /lo/la/li/le + verb pronoun (and not an article) lo in lo stesso article time expressions + fa adverb upper case Di/D + proper name proper name upper case Chi relative pronoun (sometimes wrong but more often right) senza + verb conjunction ieri + preposition or altro noun più + noun adjective H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 12/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 13/ 17
Configuration EAGLES DISTRIB TT 96.27 96.18 TT + RegExpGuess 97.08 96.86 TT + ExtLex 97.61 97.25 TT + RegExpGuess + ExtLex 97.76 97.37 TT + RegExpGuess + ExtLex + Rules 97.89 NA Table: Test set percentage accuracy of TreeTagger in various configurations (highest accuracies are those of TT + RegExpGuess + ExtLex + Rules submitted to EVALITA) H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 14/ 17
Overview 1 2 3 4 5 H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 15/ 17
H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17
H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid. improvements in part-of-speech tagging with and application to German. Proceedings of the ACL SIGDAT-Workshop, 1995. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17
H. Schmid. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, 1994. H. Schmid. improvements in part-of-speech tagging with and application to German. Proceedings of the ACL SIGDAT-Workshop, 1995. E. Zanchetta, and M. Baroni. Morph-it! A free corpus-based morphological resource for the Italian language. Proceedings of Corpus Linguistics 2005, 2006. H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 16/ 17
Links TreeTagger http://www.ims.uni-stuttgart.de/projekte/corplex/ TreeTagger Morph-it! http://sslmit.unibo.it/morphit H. Schmid, M. Baroni, E. Zanchetta, A. Stein The Enriched TreeTagger System 17/ 17