Document Comparison with a Weighted Topic Hierarchy

Similar documents
Document Indexing with a Concept Hierarchy Índice de Documentos con una Jerarquía de Conceptos

PRACTICAL, COMPUTATION EFFICIENT HIGH-ORDER NEURAL NETWORK FOR ROTATION AND SHIFT INVARIANT PATTERN RECOGNITION. Evgeny Artyomov and Orly Yadid-Pecht

ANNUAL OF NAVIGATION 11/2006

Method of identification of patent trends based on descriptions of technical functions

Dynamic Optimization. Assignment 1. Sasanka Nagavalli January 29, 2013 Robotics Institute Carnegie Mellon University

Calculation of the received voltage due to the radiation from multiple co-frequency sources

Guidelines for CCPR and RMO Bilateral Key Comparisons CCPR Working Group on Key Comparison CCPR-G5 October 10 th, 2014

Uncertainty in measurements of power and energy on power networks

A Simple Satellite Exclusion Algorithm for Advanced RAIM

Research of Dispatching Method in Elevator Group Control System Based on Fuzzy Neural Network. Yufeng Dai a, Yun Du b

Learning Ensembles of Convolutional Neural Networks

To: Professor Avitabile Date: February 4, 2003 From: Mechanical Student Subject: Experiment #1 Numerical Methods Using Excel

Side-Match Vector Quantizers Using Neural Network Based Variance Predictor for Image Coding

Application of Intelligent Voltage Control System to Korean Power Systems

A Comparison of Two Equivalent Real Formulations for Complex-Valued Linear Systems Part 2: Results

A Preliminary Study on Targets Association Algorithm of Radar and AIS Using BP Neural Network

Parameter Free Iterative Decoding Metrics for Non-Coherent Orthogonal Modulation

Research Article Indoor Localisation Based on GSM Signals: Multistorey Building Study

Efficient Large Integers Arithmetic by Adopting Squaring and Complement Recoding Techniques

MTBF PREDICTION REPORT

Digital Transmission

High Speed ADC Sampling Transients

Understanding the Spike Algorithm

Fast Code Detection Using High Speed Time Delay Neural Networks

UNIT 11 TWO-PERSON ZERO-SUM GAMES WITH SADDLE POINT

Incident Threading for News Passages

IEE Electronics Letters, vol 34, no 17, August 1998, pp ESTIMATING STARTING POINT OF CONDUCTION OF CMOS GATES

Review: Our Approach 2. CSC310 Information Theory

Secure Transmission of Sensitive data using multiple channels

Priority based Dynamic Multiple Robot Path Planning

熊本大学学術リポジトリ. Kumamoto University Repositor

1 GSW Multipath Channel Models

A New Type of Weighted DV-Hop Algorithm Based on Correction Factor in WSNs

Weighted Penalty Model for Content Balancing in CATS

Estimating Mean Time to Failure in Digital Systems Using Manufacturing Defective Part Level

DETERMINATION OF WIND SPEED PROFILE PARAMETERS IN THE SURFACE LAYER USING A MINI-SODAR

Comparison of Two Measurement Devices I. Fundamental Ideas.

A MODIFIED DIRECTIONAL FREQUENCY REUSE PLAN BASED ON CHANNEL ALTERNATION AND ROTATION

Comparative Analysis of Reuse 1 and 3 in Cellular Network Based On SIR Distribution and Rate

Inverse Halftoning Method Using Pattern Substitution Based Data Hiding Scheme

Enhanced Artificial Neural Networks Using Complex Numbers

An Energy-aware Awakening Routing Algorithm in Heterogeneous Sensor Networks

A Multi-Supports-Based Sequential Pattern Mining Algorithm 1

Topology Control for C-RAN Architecture Based on Complex Network

A study of turbo codes for multilevel modulations in Gaussian and mobile channels

Low Sampling Rate Technology for UHF Partial Discharge Signals Based on Sparse Vector Recovery

A MODIFIED DIFFERENTIAL EVOLUTION ALGORITHM IN SPARSE LINEAR ANTENNA ARRAY SYNTHESIS

Yarn tenacity modeling using artificial neural networks and development of a decision support system based on genetic algorithms

Rational Secret Sharing without Broadcast

THEORY OF YARN STRUCTURE by Prof. Bohuslav Neckář, Textile Department, IIT Delhi, New Delhi. Compression of fibrous assemblies

Video Occupant Detection for Airbag Deployment

Time-frequency Analysis Based State Diagnosis of Transformers Windings under the Short-Circuit Shock

HELPFUL OR UNHELPFUL: A LINEAR APPROACH FOR RANKING PRODUCT REVIEWS

antenna antenna (4.139)

Joint Power Control and Scheduling for Two-Cell Energy Efficient Broadcasting with Network Coding

POLYTECHNIC UNIVERSITY Electrical Engineering Department. EE SOPHOMORE LABORATORY Experiment 1 Laboratory Energy Sources

Prevention of Sequential Message Loss in CAN Systems

@IJMTER-2015, All rights Reserved 383

Networks. Backpropagation. Backpropagation. Introduction to. Backpropagation Network training. Backpropagation Learning Details 1.04.

Comparison of Global Term Expansion Methods for Text Retrieval

Performance Analysis of the Weighted Window CFAR Algorithms

Machine Learning in Production Systems Design Using Genetic Algorithms

Ensemble Evolution of Checkers Players with Knowledge of Opening, Middle and Endgame

Walsh Function Based Synthesis Method of PWM Pattern for Full-Bridge Inverter

Analysis of Time Delays in Synchronous and. Asynchronous Control Loops. Bj rn Wittenmark, Ben Bastian, and Johan Nilsson

Arterial Travel Time Estimation Based On Vehicle Re-Identification Using Magnetic Sensors: Performance Analysis

Control Chart. Control Chart - history. Process in control. Developed in 1920 s. By Dr. Walter A. Shewhart

Resource Allocation Optimization for Device-to- Device Communication Underlaying Cellular Networks

A Pervasive Indoor-Outdoor Positioning System

Graphical User-ID with Partial Match Scheme and Application for Mobile Web-Services

Chaotic Filter Bank for Computer Cryptography

NOVEL ITERATIVE TECHNIQUES FOR RADAR TARGET DISCRIMINATION

Utility-based Routing

Generalized Incomplete Trojan-Type Designs with Unequal Cell Sizes

Optimization Frequency Design of Eddy Current Testing

Network Theory. EC / EE / IN. for

Probabilistic Structured Query Methods

Test 2. ECON3161, Game Theory. Tuesday, November 6 th

Robot Docking Based on Omnidirectional Vision and Reinforcement Learning

A TWO-PLAYER MODEL FOR THE SIMULTANEOUS LOCATION OF FRANCHISING SERVICES WITH PREFERENTIAL RIGHTS

Optimal Placement of PMU and RTU by Hybrid Genetic Algorithm and Simulated Annealing for Multiarea Power System State Estimation

MULTIPLE LAYAR KERNEL-BASED APPROACH IN RELEVANCE FEEDBACK CONTENT-BASED IMAGE RETRIEVAL SYSTEM

Low Switching Frequency Active Harmonic Elimination in Multilevel Converters with Unequal DC Voltages

current activity shows on the top right corner in green. The steps appear in yellow

Finding Person X: Correlating Names with Visual Appearances

A Fuzzy-based Routing Strategy for Multihop Cognitive Radio Networks

An Adaptive Over-current Protection Scheme for MV Distribution Networks Including DG

Distributed Fault Detection of Wireless Sensor Networks

Predicting the Evolution of Communities in Social Networks Using Structural and Temporal Features

Multi-focus Image Fusion Using Spatial Frequency and Genetic Algorithm

EEG Channel Selection Using Decision Tree in Brain-Computer Interface

Adaptive Fault Tolerance in Real-Time Information Systems

High Speed, Low Power And Area Efficient Carry-Select Adder

Impacts of the Service Quality of single Road Facilities on the Service Quality in Networks

A Novel Optimization of the Distance Source Routing (DSR) Protocol for the Mobile Ad Hoc Networks (MANET)

Development of Neural Networks for Noise Reduction

Fall 2018 #11 Games and Nimbers. A. Game. 0.5 seconds, 64 megabytes

FEATURE SELECTION FOR SMALL-SIGNAL STABILITY ASSESSMENT

Figure.1. Basic model of an impedance source converter JCHPS Special Issue 12: August Page 13

Malicious User Detection in Spectrum Sensing for WRAN Using Different Outliers Detection Techniques

Transcription:

Document Comparson wth a Weghted Topc Herarchy A. Gelbukh, G. Sdorov, and A. Guzmán-Arenas Natural Language Laboratory, Center for Computng Research (CIC), Natonal Polytechnc Insttute (IPN), Mexco Cty {gelbukh, sdorov, aguzman}@pollux.cc.pn.mx Abstract A method of document comparson based on a herarchcal dctonary of topcs (concepts) s descrbed. The herarchcal lnks n the dctonary are suppled wth the weghts that are used for detectng the man topcs of a document and for determnng the smlarty between two documents. The method allows for the comparson of documents that do not share any words lterally but do share concepts, ncludng comparson of documents n dfferent languages. Also, the method allows for comparson wth respect to a specfc aspect,.e., a specfc topc of nterest (wth ts respectve subtopcs). A system Classfer usng the dscussed method for document classfcaton and nformaton retreval s dscussed. 1. Introducton* In ths artcle, a document comparson method based on document classfcaton s dscussed. The task of document classfcaton can be examned from dfferent ponts of vew [1], [6], [8], [10]. We consder t as assgnment of one or several topcs to the document. For example, some documents are about health, and some about poltcs. Accordngly, we consder document comparson wth respect to such a classfcaton: two documents are smlar f they share ther prncpal topcs. In some exstng systems, such as [9], [12], the contents of the document s characterzed by the words frequently used n the document, wth no external dctonares beng used. In our work, the documents are related to the entres of a pre-determned dctonary of concepts organzed n a herarchcal structure. The dctonary, though, s large, so that statstcal methods can be appled to ts entres. * The work done under partal support of DEPI-IPN, CONACyT grant 26424-A, REDII-CONACyT, and COFAA-IPN, Mexco. In our approach to document classfcaton and comparson, a document s assocated wth many topcs rather than only one, prncpal, topc. More precsely, a document s characterzed by a vector of topc weghts representng a measure of correspondence of the document to each of the avalable topcs. Ths stll allows for a more tradtonal vew on classfcaton: the topc(s) wth the best value of ths measure s the prncpal topc(s) of the document. On Fgure 1, a screen shot of our program, Classfer, wth a hstogram r of the topcs for a Spansh document s shown. Concept herarches have been extensvely used n nformaton retreval and recently n text mnng [5], [11]. In [3], [4] t was proposed to use a herarchcal dctonary for determnng the man themes of a document. In ths paper, we dscuss the use of the weghts r for document comparson. Frst, the dctonary structure s presented. Then, the algorthm for calculaton of the topc weghts r s descrbed; we also touch upon the ssue of calculaton of the lnk weghts n the dctonary. Fnally, the algorthm of document comparson s dscussed. 2. Weghted topc herarchy The dctonary conssts of two maor parts: vocabulary and the herarchcal structure. The vocabulary ncludes syntagmatc unts,.e., ndvdual words lke Italy or word combnatons lke the Unted States of Amerca; we wll call any such unt a keyword. The herarchcal structure represents semantc unts,.e., concepts, or topcs. It s a tree or, more generally, a drected acyclc graph, whch represents the concepts by groupng together the words or other concepts. For example, a concept Europe ncludes, among others, the word Europe and the concepts Western Europe, Eastern Europe, Schengen states, etc. r

Fgure 1. Topc hstogram for a document n Spansh. Fgure 2 shows an example of a dctonary entry. The herarchy of non-termnal concepts s shown n the left sde of the pcture; the keywords mmedately belongng to the selected topc are shown n the rght sde. Snce our approach s language ndependent, words of dfferent languages can be mxed n the dctonary. Consequently, dependng on the settngs chosen by the user, the system can ether autodetect the document language and use only the words of the detected language, or use the words of all avalable languages. As the bottom rght part of Fgure 2 shows, so far our dctonary s mplemented n Englsh, French, and Spansh. The lnks n the herarchy have dfferent strength expressed wth the weghts of the lnks. These weghts roughly correspond to the probablty for the word n a partcular context to be really related to the gven topc. For example, the word Italy or the concept (group) Schengen states n practcally any context belong to the topc Europe; thus, the weght of ths lnk s 1. On the other hand, the word London can refer to a cty n England or, wth much less probablty, n Canada; consequently, the weght of the lnk between London and England s, say, 0.9. The lnk between Englsh and England s very weak because Englsh language s frequently used wth no relaton to England. Assgnng the weghts to the lnks s not a trval task, but here we can not deep nto detals. In short, the weght w of the lnk between a node and ts parent node characterzes the mean relevance of the documents contanng ths word for the gven topc. For termnal nodes, a smplfed way of automatc assgnment of the weghts of ther lnks to ther parent concepts conssts n adoptng the nverse proporton to the frequency of the word: 1 w n k kd ndependently of the parent topc. Here n k s the number of occurrences of the termnal node n the document k, and summaton s done by the documents of a tranng corpus D. For example, the artcles a and the have a (nearly) zero weght for any topc, whle the word carburetor has a hgh weght n any topc n whch t s ncluded. As to the lnks between non-termnal concepts, we wll not dscuss here the ssue of assgnment of ther weghts. Snce for a shallow herarchy the number of such lnks s not very large, the weghts can be assgned manually or ust consdered beng all equal to 1.

Fgure 2. Herarchcal dctonary used by the system. 3. Calculaton of topc weghts The algorthm of applcaton of the dctonary for detecton of the topc weghts s, n the frst approxmaton, very smple. To determne the weghts r of the topcs (nodes) for a gven document, the followng two passes are performed: 1. For each termnal node of the herarchy,.e., each keyword (a sngle word or a word combnaton), ts frequency n the text s determned; the topc weght r for ths node s equal to ts frequency. 2. The frequences are propagated recursvely by the lnks n the herarchy: r w r. Here summaton s performed by the dependent nodes of the gven node; w s the weght of the lnk between the current node and the dependent node. Note that such an algorthm leads to very hgh weghts of the top nodes of the herarchy: all the documents prove to have obects and actons as ther prncpal topcs. Handlng ths effect n the applcaton n whch t presents a problem goes beyond the scope of ths artcle. Ths effect, however, does not present any problem n a shallow or onelevel herarchy. The set of topcs can be restrcted by the user; such a restrcton s a part of the user s query. In the smplest case, the search query conssts n selectng a subtree of the topc herarchy by selectng a desrable top node. Only the topcs below ths node wll partcpate n the calculatons. One more screen shot of the Classfer program s shown on Fgure 3. The words and topcs found n the selected document (wth Spansh ttle Hombres de lucha, meor ) for the selected topc Insttutons are presented. The words are shown wth ther frequences n the document, and the non-termnal topcs wth ther calculated weghts for ths document. 4. Document comparson Thus, we defne the document mage as a vector of topc weghts ( r ). Ths vector ncludes all nodes of the herarchy. As a varant of our approach, ths vector can nclude only non-termnal nodes,.e., groups of keywords; ths greatly decreases memory requrements and ncreases the effcency of the algorthm. For the purposes of comparson, n most cases, the user s not nterested n the absolute amount of nformaton conveyed by a document,.e., the total number of words n the document that are related to a specfc topc. Instead,

Fgure 3. Countng keywords for a Spansh document. the user s nterested n the theme to whch the document s devoted,.e., the densty of the specfc keywords. In ths case we normalze the document mage by dvdng each coordnate r by the total number of words n the document. Thus, the sze of the documents does not affect the results of comparson. The dstance between the two documents D 1 and D 2 can be defned now n terms of these vectors. There are several possble ways to determne the dstance between two vectors, dependng on the needs of the user. The smplest way s a weghted combnaton of absolute dfferences of coordnates: D1 D2 r1 r2, where summaton s performed by all nodes of the herarchy. Here are the mportance coeffcents of the nodes of the herarchy. In the deal case they would reflect the user s profle: the user assgns greater coeffcents to the most mportant topcs. However, n practce most of them have to be predefned. In our system, they are assgned accordng to the followng rules: the coeffcents of ndvdual keywords are much less than those of any group (non-termnal node), or even are zeroes as t was dscussed above; the coeffcents of the lowest-level non-termnal nodes are maxmal; and the coeffcents of the top-level nodes are the less the hgher the level. Effectvely, the comparson s done by the low-level groups of keywords. On the one hand, ths makes t possble for two documents to be very smlar even f they do not have any common words lterally but do share a common topc. On the other hand, the documents that do share keywords are stll slghtly closer than those that only share topcs are. An nterestng applcaton of the method s classfcaton of the documents by smlarty wth respect to a gven topc. Clearly, a document mentonng the use of anmals for mltary purposes and the document mentonng feedng of anmals are smlar (both menton anmals) from the pont of vew of a bologst, whle from the pont of vew of a mltary man they are very dfferent. Ths s handled by selectng the aspect of comparson a subtree of the topc herarchy, so that the document mages contan only the selected topcs. 5. Conclusons and future work We have dscussed a method of document comparson based on the use of a weghted herarchy of topcs (concepts). The method has the followng advantages: 1. The documents that do not share any words lterally stll can be dentfed as smlar ones f they do share common topcs.

2. The comparson can be done takng nto account the user profle, or the aspect a subset (subtree) of topcs that are of nterest for the user. The need n a large dctonary s a dsadvantage of the method. However, the method has proved to be nsenstve to a rather low qualty of the dctonary. For example, n our experments we used a French dctonary that was an automatc translaton of the Englsh one. We appled our algorthm to a set of Englsh documents and the correspondng set of ther manual French translatons; the dfference n the results was nsgnfcant. Also, the documents representng the same text n dfferent languages were reported by the algorthm as very smlar. Though generally the results obtaned n our experments showed good accordance wth the opnon of human experts, we have encountered some problems wth usng our method. Most of such problems are related wth lexcal ambguty of dfferent types, such as well (noun versus adverb) or bll (fve dfferent meanngs as a noun) [7]. In the future, we plan to apply a part of speech tagger to resolve the ambguty of the frst type, and mplement an algorthm makng use of dfferent senses lke bll 1, bll 2 manual marked up n the dctonary; such an algorthm can be thesaurus-based [2] or statstcal. Another drecton of mprovement of the algorthm s takng nto account the anaphorc relatonshps n the text. For example, the pronouns and zero subects (n Spansh) could be replaced wth the correspondng nouns. References [1] Cohen, W., Snger, Y.: Context-senstve Learnng Methods for Text Categorzaton. In: SIGIR'96 (1996) [2] Gelbukh, A.: Usng a Semantc Network for Lexcal and Syntactc Dsambguaton. In: Proceedngs of Smposum Internaconal de Computacón: Nuevas Aplcacones e Innovacones Tecnológcas en Computacón. Mexco (1997) [3] Guzmán-Arenas, A.: Hallando los Temas Prncpales en un Artículo en Español. Solucones Avanzadas 5 (45) (1997) 58, 5 (49) (1997) 66 [4] Guzmán-Arenas, A.: Fndng the Man Themes n a Spansh Document. Journal Expert Systems wth Applcatons 14 (1, 2) (1998) 139-148 [5] Feldman R., I. Dagan: Knowledge Dscovery n Textual Databases (KDT), In Proc. of Intern. Symposum KDD-95, pages 112-117, Montreal, Canada (1995) [6] Jacob, E. K.: Cognton and Classfcaton: A Crossdscplnary Approach to a Phlosophy of Classfcaton. (Abstract.) In: Maxan, B. (ed.): ASIS '94: Proceedngs of the 57th ASIS Annual Meetng. Medford, NJ: Learned Informaton (1994) 82 [7] Krowetz, B.: Homonymy and Polysemy n Informaton Retreval. In: Proceedngs of the 35th Annual Meetng of the Assocaton for Computatonal Lngustcs (1997) 72-79 [8] Lews, D. D., Rnguette, M.: A Comparson of Two Learnng Algorthms for Text Categorzaton. In: Thrd Annual Symposum on Document Analyss and Informaton Retreval (1994) 81-93 [9] Meuner, Jean Guy, Lakhdar Remak, and Domnc Forest: Use of classfers n computer-asssted readng and analyss of texts (CARAT). To be publshed. (1999) [10] Rloff, E., Shepherd, J.: A Corpus Based Approach for Buldng Semantc Lexcons. In: Proceedngs of the Second Conference on Emprcal Methods n Natural Language Processng (EMNLP-2) (1997) [11] Srkant R., R. Agrawal: Mnng Sequental Patterns: Generalzatons and Performance Improvements. In Proc. of the 5 th Internatonal Conference on Extendng Database Technology (EDBT), Avgnon, France, March (1996) [12] TextAnalyst system. http://www.analyst.ru (n Russan), http://www.megaputer.com (1998)