Document Comparison with a Weighted Topic Hierarchy

Document Comparson wth a Weghted Topc Herarchy A. Gelbukh, G. Sdorov, and A. Guzmán-Arenas Natural Language Laboratory, Center for Computng Research (CIC), Natonal Polytechnc Insttute (IPN), Mexco Cty {gelbukh, sdorov, aguzman}@pollux.cc.pn.mx Abstract A method of document comparson based on a herarchcal dctonary of topcs (concepts) s descrbed. The herarchcal lnks n the dctonary are suppled wth the weghts that are used for detectng the man topcs of a document and for determnng the smlarty between two documents. The method allows for the comparson of documents that do not share any words lterally but do share concepts, ncludng comparson of documents n dfferent languages. Also, the method allows for comparson wth respect to a specfc aspect,.e., a specfc topc of nterest (wth ts respectve subtopcs). A system Classfer usng the dscussed method for document classfcaton and nformaton retreval s dscussed. 1. Introducton* In ths artcle, a document comparson method based on document classfcaton s dscussed. The task of document classfcaton can be examned from dfferent ponts of vew [1], [6], [8], [10]. We consder t as assgnment of one or several topcs to the document. For example, some documents are about health, and some about poltcs. Accordngly, we consder document comparson wth respect to such a classfcaton: two documents are smlar f they share ther prncpal topcs. In some exstng systems, such as [9], [12], the contents of the document s characterzed by the words frequently used n the document, wth no external dctonares beng used. In our work, the documents are related to the entres of a pre-determned dctonary of concepts organzed n a herarchcal structure. The dctonary, though, s large, so that statstcal methods can be appled to ts entres. * The work done under partal support of DEPI-IPN, CONACyT grant 26424-A, REDII-CONACyT, and COFAA-IPN, Mexco. In our approach to document classfcaton and comparson, a document s assocated wth many topcs rather than only one, prncpal, topc. More precsely, a document s characterzed by a vector of topc weghts representng a measure of correspondence of the document to each of the avalable topcs. Ths stll allows for a more tradtonal vew on classfcaton: the topc(s) wth the best value of ths measure s the prncpal topc(s) of the document. On Fgure 1, a screen shot of our program, Classfer, wth a hstogram r of the topcs for a Spansh document s shown. Concept herarches have been extensvely used n nformaton retreval and recently n text mnng [5], [11]. In [3], [4] t was proposed to use a herarchcal dctonary for determnng the man themes of a document. In ths paper, we dscuss the use of the weghts r for document comparson. Frst, the dctonary structure s presented. Then, the algorthm for calculaton of the topc weghts r s descrbed; we also touch upon the ssue of calculaton of the lnk weghts n the dctonary. Fnally, the algorthm of document comparson s dscussed. 2. Weghted topc herarchy The dctonary conssts of two maor parts: vocabulary and the herarchcal structure. The vocabulary ncludes syntagmatc unts,.e., ndvdual words lke Italy or word combnatons lke the Unted States of Amerca; we wll call any such unt a keyword. The herarchcal structure represents semantc unts,.e., concepts, or topcs. It s a tree or, more generally, a drected acyclc graph, whch represents the concepts by groupng together the words or other concepts. For example, a concept Europe ncludes, among others, the word Europe and the concepts Western Europe, Eastern Europe, Schengen states, etc. r

Fgure 1. Topc hstogram for a document n Spansh. Fgure 2 shows an example of a dctonary entry. The herarchy of non-termnal concepts s shown n the left sde of the pcture; the keywords mmedately belongng to the selected topc are shown n the rght sde. Snce our approach s language ndependent, words of dfferent languages can be mxed n the dctonary. Consequently, dependng on the settngs chosen by the user, the system can ether autodetect the document language and use only the words of the detected language, or use the words of all avalable languages. As the bottom rght part of Fgure 2 shows, so far our dctonary s mplemented n Englsh, French, and Spansh. The lnks n the herarchy have dfferent strength expressed wth the weghts of the lnks. These weghts roughly correspond to the probablty for the word n a partcular context to be really related to the gven topc. For example, the word Italy or the concept (group) Schengen states n practcally any context belong to the topc Europe; thus, the weght of ths lnk s 1. On the other hand, the word London can refer to a cty n England or, wth much less probablty, n Canada; consequently, the weght of the lnk between London and England s, say, 0.9. The lnk between Englsh and England s very weak because Englsh language s frequently used wth no relaton to England. Assgnng the weghts to the lnks s not a trval task, but here we can not deep nto detals. In short, the weght w of the lnk between a node and ts parent node characterzes the mean relevance of the documents contanng ths word for the gven topc. For termnal nodes, a smplfed way of automatc assgnment of the weghts of ther lnks to ther parent concepts conssts n adoptng the nverse proporton to the frequency of the word: 1 w n k kd ndependently of the parent topc. Here n k s the number of occurrences of the termnal node n the document k, and summaton s done by the documents of a tranng corpus D. For example, the artcles a and the have a (nearly) zero weght for any topc, whle the word carburetor has a hgh weght n any topc n whch t s ncluded. As to the lnks between non-termnal concepts, we wll not dscuss here the ssue of assgnment of ther weghts. Snce for a shallow herarchy the number of such lnks s not very large, the weghts can be assgned manually or ust consdered beng all equal to 1.

Fgure 2. Herarchcal dctonary used by the system. 3. Calculaton of topc weghts The algorthm of applcaton of the dctonary for detecton of the topc weghts s, n the frst approxmaton, very smple. To determne the weghts r of the topcs (nodes) for a gven document, the followng two passes are performed: 1. For each termnal node of the herarchy,.e., each keyword (a sngle word or a word combnaton), ts frequency n the text s determned; the topc weght r for ths node s equal to ts frequency. 2. The frequences are propagated recursvely by the lnks n the herarchy: r w r. Here summaton s performed by the dependent nodes of the gven node; w s the weght of the lnk between the current node and the dependent node. Note that such an algorthm leads to very hgh weghts of the top nodes of the herarchy: all the documents prove to have obects and actons as ther prncpal topcs. Handlng ths effect n the applcaton n whch t presents a problem goes beyond the scope of ths artcle. Ths effect, however, does not present any problem n a shallow or onelevel herarchy. The set of topcs can be restrcted by the user; such a restrcton s a part of the user s query. In the smplest case, the search query conssts n selectng a subtree of the topc herarchy by selectng a desrable top node. Only the topcs below ths node wll partcpate n the calculatons. One more screen shot of the Classfer program s shown on Fgure 3. The words and topcs found n the selected document (wth Spansh ttle Hombres de lucha, meor ) for the selected topc Insttutons are presented. The words are shown wth ther frequences n the document, and the non-termnal topcs wth ther calculated weghts for ths document. 4. Document comparson Thus, we defne the document mage as a vector of topc weghts ( r ). Ths vector ncludes all nodes of the herarchy. As a varant of our approach, ths vector can nclude only non-termnal nodes,.e., groups of keywords; ths greatly decreases memory requrements and ncreases the effcency of the algorthm. For the purposes of comparson, n most cases, the user s not nterested n the absolute amount of nformaton conveyed by a document,.e., the total number of words n the document that are related to a specfc topc. Instead,

Fgure 3. Countng keywords for a Spansh document. the user s nterested n the theme to whch the document s devoted,.e., the densty of the specfc keywords. In ths case we normalze the document mage by dvdng each coordnate r by the total number of words n the document. Thus, the sze of the documents does not affect the results of comparson. The dstance between the two documents D 1 and D 2 can be defned now n terms of these vectors. There are several possble ways to determne the dstance between two vectors, dependng on the needs of the user. The smplest way s a weghted combnaton of absolute dfferences of coordnates: D1 D2 r1 r2, where summaton s performed by all nodes of the herarchy. Here are the mportance coeffcents of the nodes of the herarchy. In the deal case they would reflect the user s profle: the user assgns greater coeffcents to the most mportant topcs. However, n practce most of them have to be predefned. In our system, they are assgned accordng to the followng rules: the coeffcents of ndvdual keywords are much less than those of any group (non-termnal node), or even are zeroes as t was dscussed above; the coeffcents of the lowest-level non-termnal nodes are maxmal; and the coeffcents of the top-level nodes are the less the hgher the level. Effectvely, the comparson s done by the low-level groups of keywords. On the one hand, ths makes t possble for two documents to be very smlar even f they do not have any common words lterally but do share a common topc. On the other hand, the documents that do share keywords are stll slghtly closer than those that only share topcs are. An nterestng applcaton of the method s classfcaton of the documents by smlarty wth respect to a gven topc. Clearly, a document mentonng the use of anmals for mltary purposes and the document mentonng feedng of anmals are smlar (both menton anmals) from the pont of vew of a bologst, whle from the pont of vew of a mltary man they are very dfferent. Ths s handled by selectng the aspect of comparson a subtree of the topc herarchy, so that the document mages contan only the selected topcs. 5. Conclusons and future work We have dscussed a method of document comparson based on the use of a weghted herarchy of topcs (concepts). The method has the followng advantages: 1. The documents that do not share any words lterally stll can be dentfed as smlar ones f they do share common topcs.

2. The comparson can be done takng nto account the user profle, or the aspect a subset (subtree) of topcs that are of nterest for the user. The need n a large dctonary s a dsadvantage of the method. However, the method has proved to be nsenstve to a rather low qualty of the dctonary. For example, n our experments we used a French dctonary that was an automatc translaton of the Englsh one. We appled our algorthm to a set of Englsh documents and the correspondng set of ther manual French translatons; the dfference n the results was nsgnfcant. Also, the documents representng the same text n dfferent languages were reported by the algorthm as very smlar. Though generally the results obtaned n our experments showed good accordance wth the opnon of human experts, we have encountered some problems wth usng our method. Most of such problems are related wth lexcal ambguty of dfferent types, such as well (noun versus adverb) or bll (fve dfferent meanngs as a noun) [7]. In the future, we plan to apply a part of speech tagger to resolve the ambguty of the frst type, and mplement an algorthm makng use of dfferent senses lke bll 1, bll 2 manual marked up n the dctonary; such an algorthm can be thesaurus-based [2] or statstcal. Another drecton of mprovement of the algorthm s takng nto account the anaphorc relatonshps n the text. For example, the pronouns and zero subects (n Spansh) could be replaced wth the correspondng nouns. References [1] Cohen, W., Snger, Y.: Context-senstve Learnng Methods for Text Categorzaton. In: SIGIR'96 (1996) [2] Gelbukh, A.: Usng a Semantc Network for Lexcal and Syntactc Dsambguaton. In: Proceedngs of Smposum Internaconal de Computacón: Nuevas Aplcacones e Innovacones Tecnológcas en Computacón. Mexco (1997) [3] Guzmán-Arenas, A.: Hallando los Temas Prncpales en un Artículo en Español. Solucones Avanzadas 5 (45) (1997) 58, 5 (49) (1997) 66 [4] Guzmán-Arenas, A.: Fndng the Man Themes n a Spansh Document. Journal Expert Systems wth Applcatons 14 (1, 2) (1998) 139-148 [5] Feldman R., I. Dagan: Knowledge Dscovery n Textual Databases (KDT), In Proc. of Intern. Symposum KDD-95, pages 112-117, Montreal, Canada (1995) [6] Jacob, E. K.: Cognton and Classfcaton: A Crossdscplnary Approach to a Phlosophy of Classfcaton. (Abstract.) In: Maxan, B. (ed.): ASIS '94: Proceedngs of the 57th ASIS Annual Meetng. Medford, NJ: Learned Informaton (1994) 82 [7] Krowetz, B.: Homonymy and Polysemy n Informaton Retreval. In: Proceedngs of the 35th Annual Meetng of the Assocaton for Computatonal Lngustcs (1997) 72-79 [8] Lews, D. D., Rnguette, M.: A Comparson of Two Learnng Algorthms for Text Categorzaton. In: Thrd Annual Symposum on Document Analyss and Informaton Retreval (1994) 81-93 [9] Meuner, Jean Guy, Lakhdar Remak, and Domnc Forest: Use of classfers n computer-asssted readng and analyss of texts (CARAT). To be publshed. (1999) [10] Rloff, E., Shepherd, J.: A Corpus Based Approach for Buldng Semantc Lexcons. In: Proceedngs of the Second Conference on Emprcal Methods n Natural Language Processng (EMNLP-2) (1997) [11] Srkant R., R. Agrawal: Mnng Sequental Patterns: Generalzatons and Performance Improvements. In Proc. of the 5 th Internatonal Conference on Extendng Database Technology (EDBT), Avgnon, France, March (1996) [12] TextAnalyst system. http://www.analyst.ru (n Russan), http://www.megaputer.com (1998)