Automatic Categorization : Future Perspectives

Similar documents
GENEVA SPECIAL UNION FOR THE INTERNATIONAL PATENT CLASSIFICATION (IPC UNION) ASSEMBLY

INIS: the world s largest nuclear information system

C. PCT 1486 November 30, 2016

International Nuclear Information System (INIS)

MEETING OF INTELLECTUAL PROPERTY OFFICES (IPOS) ON ICT STRATEGIES AND ARTIFICAL INTELLIGENCE (AI) FOR IP ADMINISTRATION

Topic Engr. Raisa C. Lubina Intellectual Property Rights Specialist II

4 th CPC Annual Meeting with National Offices

Outline of Japanese Patent Classification Systems

KIPO s plan for AI - Are you ready for AI? - Gyudong HAN, KIPO Republic of Korea

WIPO Pearl - The Terminology Portal of the World Intellectual Property Organization

11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO

Where to File Patent Application Yumiko Hamano IP Consultant - IP Commercialization Partner, ET Cube International

Establishment of a New Classification regarding IoT (Internet of Things)

Traditional Knowledge Digital Library. Presentation Adapted from Dr. V K Gupta, CSIR

IBM SPSS Neural Networks

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Exploring the New Trends of Chinese Tourists in Switzerland

Committee on Development and Intellectual Property (CDIP)

FORESIGHT AND UNDERSTANDING FROM SCIENTIFIC EXPOSITION (FUSE) Incisive Analysis Office. Dewey Murdick Program Manager

Twenty years of Translating and the Computer John Hutchins (University of East Anglia)

PCT Status Report. Francis Gurry

Introduction to Machine Learning

Convolutional neural networks

Committee on Development and Intellectual Property (CDIP)

Latest trends in sentiment analysis - A survey

The Patent Cooperation Treaty (PCT) at the center of the international patent system

Navigating the AI Adoption Minefield Pitfalls, best practices, and developing your own AI roadmap April 11

Revisiting the USPTO Concordance Between the U.S. Patent Classification and the Standard Industrial Classification Systems

AI: The New Electricity to Harness Our Digital Future Lindholmen Software Development Day Oct

INTRODUCTION TO DEEP LEARNING. Steve Tjoa June 2013

Local Language Computing Policy in Korea

H04W 4/04, 4/043, 4/046

Nuclear Safety. and. the International Atomic Energy Agency

Interim report. Development of national tools for the codification of occupations according to ISCO 08. Grant agreement No

An Intellectual Property Whitepaper by Katy Wood of Minesoft in association with Kogan Page

What magic can we do with linked data sets? (and a new data set for linking soon to be available at NBER)

DENSO www. densocorp-na.com

Lecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Number Plate Detection with a Multi-Convolutional Neural Network Approach with Optical Character Recognition for Mobile Devices

Industry's Expectation on Classification. Bernd Wolter (Siemens AG, Munich)

Biologically Inspired Computation

EUROPEAN PATENT OFFICE U.S. PATENT AND TRADEMARK OFFICE CPC NOTICE OF CHANGES 98 DATE: JULY 1, 2015 PROJECT RP0104. Action* Subclass Group(s)

Datacard Firmware Update Frequently Asked Questions for Desktop Printer Users

How Machine Learning and AI Are Disrupting the Current Healthcare System. Session #30, March 6, 2018 Cris Ross, CIO Mayo Clinic, Jim Golden, PwC

CSC 578 Neural Networks and Deep Learning

CROSS-LAYER FEATURES IN CONVOLUTIONAL NEURAL NETWORKS FOR GENERIC CLASSIFICATION TASKS. Kuan-Chuan Peng and Tsuhan Chen

A Survey of Automated Hierarchical Classification of Patents

Information Sources in Patents

COMMERCIAL DATA SHEET

Why networking organizations are so valuable in patent information - together we are strong. Monika Hanelt Agfa Graphics NV

WIPO-MOST INTERMEDIATE TRAINING COURSE ON PRACTICAL INTELLECTUAL PROPERTY ISSUES IN BUSINESS

Embedding Artificial Intelligence into Our Lives

Contents 1 Introduction Optical Character Recognition Systems Soft Computing Techniques for Optical Character Recognition Systems

A.I in Automotive? Why and When.

PCT FAQs. Protecting your Inventions Abroad: Frequently Asked Questions About the Patent Cooperation Treaty (PCT)

Industry Outlook September 2015

AI: The New Electricity

EUROPEAN PATENT OFFICE U.S. PATENT AND TRADEMARK OFFICE CPC NOTICE OF CHANGES 557 AUGUST 1, 2018 PROJECT RP0245

ADDENDUM 1 and 2 September 29, 2017 Request for Proposals: Wi-Fi and Wireless Backhaul Replacement

INTELLIGENCE EXPLOSION: SCIENCE OR FICTION? Bart Selman Cornell University

Appendix Table 1: Distribution of Credit Granting Scores and Test Timing

Revision of C Guide for Application of Monitoring Equipment to Liquid Immersed Transformers and Components. Mike Spurlock Chairman

Artificial Intelligence Machine learning and Deep Learning: Trends and Tools. Dr. Shaona

Executive summary. AI is the new electricity. I can hardly imagine an industry which is not going to be transformed by AI.

TITLE OF PRESENTATION. Elsevier s Challenge. Dynamic Knowledge Stores and Machine Translation. Presented By Marius Doornenbal,, Anna Tordai

Foundations of Artificial Intelligence Introduction State of the Art Summary. classification: Board Games: Overview

Patents and Trademarks, a source of business opportunities

MSc(CompSc) List of courses offered in

2010 World Programme on Population and Housing Censuses Final Report March 2009 to February 2010

Requirement Definition

Global Public Health Intelligence Network (GPHIN)

Classification Accuracies of Malaria Infected Cells Using Deep Convolutional Neural Networks Based on Decompressed Images

EPO Patent Information Services and Climate Change Mitigation Technologies

ISO INTERNATIONAL STANDARD

2016 Census Profile on the Town of Richmond Hill

A comparative study of different feature sets for recognition of handwritten Arabic numerals using a Multi Layer Perceptron

Application of Deep Learning in Software Security Detection

Chapter 3 WORLDWIDE PATENTING ACTIVITY

Lecture 17 Convolutional Neural Networks

. Development of PAJ

Meeting of International Authorities under the Patent Cooperation Treaty (PCT)

An Introduction to Convolutional Neural Networks. Alessandro Giusti Dalle Molle Institute for Artificial Intelligence Lugano, Switzerland

Roles of IP Offices for Dissemination and Utilization of PI (Patent Information)

Research on Hand Gesture Recognition Using Convolutional Neural Network

World Trade Organization Regional Workshop, Hong Kong, November 11 to 13, 2014

User Research in Fractal Spaces:

International Business Minor Requirement b Approved Culture Courses at the 300/4 level Course Offerings Fall 2018

A Knowledge Discovery Framework for XML-Literature-Data

Content-Based Multimedia Analytics: Rethinking the Speed and Accuracy of Information Retrieval for Threat Detection

2. Particulars of Organization & functions

JocondeLab. DGLFLF Brigitte TRAN. Délégation générale à la langue française et aux langues de France

TOURISM and Technology:

An Introduction to Machine Learning for Social Scientists

Lecture 1 What is AI?

The User Activity Reasoning Model Based on Context-Awareness in a Virtual Living Space

Fifteen Months Until Census Day: The Bureau is Preparing

WHITE PAPER. NLP TOOL (Natural Language Processing) User Case: isocialcube (Social Networks Campaign Management)

GENEVA COMMITTEE ON DEVELOPMENT AND INTELLECTUAL PROPERTY (CDIP) Fifth Session Geneva, April 26 to 30, 2010

International Patent Cartels and Patent Statistics Between World War I and World War II

Machine Vision in Austria

Transcription:

Automatic Categorization : Future Perspectives Jacques Guyot (jacques@simple-shift.com / jacques@olanto.org ) WIPO Geneva February 2017

Services & Researches Simple-Shift A computer consulting company specializing in language engineering o Installation, maintenance, adaptation to the context of the organization o Have been installing CAT tools for more than 16 years, mainly for international organizations Olanto o Olanto is a non-profit foundation ( Free Software - AGPL ) o Compete with nobody, but can be useful to every, is open to translators, terminologists, computer scientists, researchers, integrators, distributors, for collaboration Software released or in development : mycat: concordancer and quote detector myprep : set of tools to prepare corpus (TMX, Bitext, Machine Translation training) myprep & mymt : set of tools to prepare corpus & statistical machine translation infrastructure myterm & How2Say: terminology manager based on TBX &terminological explorer for multilingual corpus myclass : an automatic classifier for multilingual documents (https://www3.wipo.int/ipccat/) mysearch : a multilingual search tool (using translation for requests). Education: a translation environment for students.

Presentation plan o What was done at WIPO (since 2004) o What can be done to improve IPCCAT o Can IPCCAT be extended to other languages?

What is being done at WIPO IPCCAT User interface available through IPC publication platform (IPCPUB): o Copy the text to be classified o Choose a classification level o Have 3 guesses o Select one o Start again with a deeper level

An example of use A boundary control device, a boundary control system, and a method of conditioning the behavior of animals are provided... upon sensing of the object by the boundary sensor.

How it's done Train a Neural Network 1. Select the English and French patents documents already classified. Keep only certain fields (title, abstract, symbols,...) 2. Validate symbols to build the training corpus 3. Build a neural Network for each node of the classification hierarchy 4. Source: 500Gb, patents kept: 22mio, symbols kept: 100mio

How it's done Published as a Web Service 1. Using the application through the WIPO interface with a browser 2. Using the Web Service through a specific application (developed externally)

What can be done to improve IPCCAT? To Increase IPC coverage in the training corpus (more symbols and at deeper level) Currently: 7,007 symbols among 72,981 in IPC 2017.01 To Increase IPCCAT accuracy Currently: Top3 at main groups 80.5% To Expand to other languages Currently: English and French

Increase coverage (more symbols) Add patents for uncovered symbols Improve the use of existing resources Put all patents and symbols in a database Extract the catalog with an intelligent strategy (CPC & IPC) The experimented result at maingroup level (2016.01): 467 missing symbols and 310 in the improved version, ie 33% progress

Increase coverage (more symbols) Add New sources for uncovered symbols Not easy to find reliable sources Not yet patent with this symbol, because too new Test with PatentScope Examples of missing symbols nb documents in Patent Scope since A23L0009 0 2016.01 A23L0015 0 2016.01 A23L0017 0 2016.01 A23L0025 0 2016.01 A23L0035 0 2016.01 A23P0020 0 2016.01 A42C0099 2 2006.01 A43D0057 2 2006.01 A43D0097 3 2006.01 A45D0097 16 2011.01

Increase depth (group level) In 2013, we conducted an experiment at the group level Technically this is possible despite a network of 60 billion neurons Should improve coverage (see above) Must increase the accuracy by adding more examples for certain groups Group Stat 2013.01 Coverage 60 042 70 870 85% Top 3 Average Precision (%) No Intermediate Step to Group 71% Intermediate Step: From Class to Group 81% Intermediate Step: From Main Group to Group 85%

Increase accuracy For all techniques: Add patents for under-populated symbols (not enough examples for training) Explore other approaches: o Support Vector Machine (SVM) o Similar results - But very slow for training (100x) o Deep Learning o Very good if the representation is hidden (sentiment analysis) o But no real improvement, for descriptive documents (without nuances) (https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-textclassification.pdf) o Need specialized machinery o To watch, see what emerges from this new technique

Increase accuracy In 2010, we participated in a challenge organized by CLEF (see http://ceur-ws.org/vol-1176/clef2010wn-clef-ip-piroiet2010.pdf) - 2 million patent corpus - classification at main group level - 12 participants -> Our approach remains in front of all the others Why? - No language processing - Keep all information -> let the neural network do the job

Can IPCPUB be extended to other languages? The first version of IPCCAT had 4 languages EN, FR, DE, RU o But as we have seen above, It is difficult to maintain a training corpus with good coverage Decide to maintain only English and French What to do for other languages? o Automated translators have improved o The classification is not sensitive to syntax errors, o Only the correctness of the terminology is important We decided to experiment the use of machine translation

Objectives of the experiment o Compare several translation engines o Choosing "difficult" languages o Assess accuracy: o In the context of the interactive classification o In the context of reclassification o Constraints: Have enough patents to do the tests Translation engines google, yandex, WIPO-translate, Bing MS Languages: German, Russian, Chinese Maingroup for interactive classification A01B 1 For reclassification simulation A01B 1, A01B 3, A01B 49

Results for Interactive classification (A01B 1) Source nb patents source date Mono class RU 69 RUPAROM 2003 yes DE 20 DEPAROM 2003 yes ZH 20 PatentScope recent? Precision Top 3 in % (The symbol is in the first three proposals) Task --> (EN %) Class A01 (87%) SubClass A01B (75%) MainGroup A0B 1 (84%) From class From subclass RU DE ZH RU DE ZH RU DE ZH RU DE ZH RU DE ZH bing 94 100 95 88 100 85 58 100 75 74 85 90 88 100 95 google 94 100 100 94 90 75 62 85 70 84 80 80 100 100 100 yandex 94 85 95 90 75 85 68 75 75 84 70 90 94 80 95 wipo 94 85 95 91 90 80 61 95 65 75 80 80 96 95 100

Results for Interactive classification (A01B 1) The automatic translation is sufficient to have honorable results (better than those of the trainings) Between the translation machines there are differences. But finally, as part of this test, they are not significant Average of 5 tasks Average RU DE ZH RUDEZH bing 81 97 88 89 google 87 91 85 88 yandex 86 77 88 84 wipo 83 89 84 85 Average 84 89 86 86

Results for reclassification o We simulate the partition of a class into three parts o T01B 0 / T01B 1 /, T01B 3 /, T01B 49 / o We train a neural network for this partition on english documents o We use yandex for the translation from russian to english o We use the first proposal for reclassification nb samples Precision(first) T01B 1 30 87% T01B 3 30 83% T01B 49 30 70% average 80% Translation can be an approach to reclassifying batches in foreign languages

Conclusion o Neural networks are efficient and simple to implement. But we must remain vigilant on the new approaches o Automatic translation is sufficiently efficient for classification tasks and allows access to automatic classification. But we have to test other languages (Arabic Spanish, Korean,...) o Emphasis should be placed on creating training corpuses o having sufficient examples for each symbol. o covering the maximum of the classification But we must remain relevant between effort and outcome o Automatic classification at group level is possible But we must add this with caution

Thank you for your interest and attention