Ambiguity Measure Feature Selection Algorithm

Size: px

Start display at page:

Download "Ambiguity Measure Feature Selection Algorithm"

Joel Burke
5 years ago
Views:

1 Ambguty Measure Feature Selecton Algorthm Saet S.R. Mengle and Nazl Goharan Informaton Retreval Lab, Illnos Insttute of Technology, Chcago, IL E-mal: {saet, Wth the ncreasng number of dgtal documents, the ablty to automatcally classfy those documents both effcently and accurately s becomng more crtcal and dffcult. One of the major problems n text classfcaton s the hgh dmensonalty of feature space. We present the ambguty measure (AM) feature-selecton algorthm, whch selects the most unambguous features from the feature set. Unambguous features are those features whose presence n a document ndcate a strong degree of confdence that a document belongs to only one specfc category. We apply AM feature selecton on a naïve Bayes text classfer. We favorably show the effectveness of our approach n outperformng eght exstng feature selecton methods, usng fve benchmar datasets wth a statstcal sgnfcance of at least 95% confdence. The support vector machne (SVM) text classfer s shown to perform consstently better than the naïve Bayes text classfer. The drawbac, however, s the tme complexty n tranng a model. We further explore the effect of usng the AM feature-selecton method on an SVM text classfer. Our results ndcate that the tranng tme for the SVM algorthm can be reduced by more than 50%, whle stll mprovng the accuracy of the text classfer. We favorably show the effectveness of our approach by demonstratng that t statstcally sgnfcantly (99% confdence) outperforms eght exstng feature-selecton methods usng four standard benchmar datasets. Introducton There s an overflow of unorganzed dgtal data n today s world. Vast volumes of dgtal text are avalable va the World Wde Web (WWW), news feeds, electronc mal, corporate databases, medcal patent records and dgtal lbrares. The problem of classfyng and storng these documents poses a sgnfcant challenge. Large companes flter ncomng e-mal and store them n folders or route them to concerned departments. News agences also use classfcaton tools for flterng or routng the news from dfferent sources to the approprate clent. Other applcatons of text classfcaton are n the feld of nowledge-base extracton, e-commerce and nformaton extracton. Companes spend sgnfcant resources on classfyng documents manually. The feasblty of manual classfcaton decreases as the number of documents ncreases over tme. As the number of documents s large, a fast and scalable automatc classfer s needed to classfy the exstng and ncomng documents accurately and effcently. We propose, desgn, develop and evaluate one such classfer. Text classfcaton nvolves scannng through the text documents, and assgnng categores to documents to reflect ther content (Yang, 1999). One of the major characterstcs of text classfcaton s the hgh dmensonalty of a feature set (Mladenć & Grobeln, 1998). The feature set for a dataset conssts of the unque terms n tranng documents. However, the number of features n the text classfcaton dataset s prohbtvely hgh for many learnng algorthms. Hence, t s hghly desrable to reduce the feature set wthout sacrfcng categorzaton accuracy. Feature selecton s formally defned n Galavott & Sebastan (2000) as the actvty of selectng, from the set of r dstnct features (.e., words) occurrng n the collecton, the subset of r r features that are most useful for compactly representng the meanng of the documents. Feature-selecton methods are used to acheve two objectves: 1. To reduce the sze of the feature set to optmze the classfcaton effcency. 2. To reduce nose n the feature set to optmze the classfcaton effectveness. Most exstng feature selecton algorthms such as odds rato (Mladenć & Grobeln, 1998), nformaton gan (Qunlan J, 1986), Ch-Squared (Yang & Pedersen, 1997), b-normal separaton (Forman, 2003) and tfcf (Chh & Kulathuramayer, 2004) calculate a score based on the probablty that a feature belongs to a gven category and the probablty that a feature does not belong to the other categores. These algorthms perform poorly on the unbalanced text classfcaton datasets. The nature of unbalanced datasets s such that a few categores have sgnfcantly more tranng documents than most of the categores, and hence, the term frequency of many features appearng n these few categores s sgnfcantly hgher than ther frequency n other categores. Moreover, f such terms have the same term frequency n two or more categores, the feature can not confdently pont to a gven category. Thus, such terms should not be consdered mportant n a sngle-labeled text classfcaton process and should be fltered. However, algorthms such as odds rato, nformaton gan, ch-squared, b-normal separaton

2 and tfcf assgn a hgher weght to these terms even f they appear n more than one category. We call these terms ambguous terms. To tacle ths problem, we present a feature selecton method called Ambguty Measure (AM; Mengle, Goharan & Platt, 2007) that assgns a hgh score to a term, f t appears consstently n only one specfc category. The ntuton s that the term that appears n only one category ponts more strongly to that specfc category and thus, s a better ndcator n a sngle-labeled classfcaton decson. We apply AM on sngle labeled Naïve Bayes text classfer and compare AM wth eght feature selecton algorthms on fve standard datasets from varous subject domans, namely news feeds, web pages, and bo-medcal text. Our results ndcate that AM feature selecton acheves statstcally sgnfcant mprovements on unbalanced datasets such as OHSUMED (20%) and Genomcs (7.5%), and on balanced datasets such as WebKB (2.6%), 20NG (2.14%) and Reuters (0.25%) when compared to the best performng feature selecton method out of the eght methods. However, the mprovements on the unbalanced datasets are larger than the mprovements on the balanced datasets. Furthermore, we also explore the effects of the AM feature selecton method when appled on the sngle labeled Support Vector Machne (SVM) algorthm (Cortes & Vapn, 1995; Joachms, 1999; Yang, Zhang & Ksel, 2003). The SVM algorthm s one of the wdely used text classfcaton algorthms. Pror wor (Joachms, 1998) ndcates that SVM performs consstently better than Naïve Bayes, NN, C4.5 and Roccho text classfers. However, one of the lmtatons of SVM s ts tranng tme complexty. Yang, Zhang & Ksel 2003) shows that SVM has a hgher tme complexty for tranng a model than other text classfcaton algorthms. To overcome ths lmtaton of SVM, feature selecton methods are used as a preprocessng step before tranng SVM (Wenqan et al., 2007; Novovcova & Mal, 2005; Yan et al., 2005). Many well-nown feature selecton algorthms are used wth SVM to mprove ts accuracy and effcency. We use the AM feature selecton method as a pre-processng step for the Support Vector Machne classfer (Mengle & Goharan, 2008). The features whose AM scores are below a gven threshold,.e., more ambguous terms, are purged whle the features whose AM scores are above a gven threshold are used for the SVM learnng phase. We favorably compare the results of AM feature selecton algorthm wth the same eght feature selecton algorthms reported n (Wenqan et al., 2007; Yan et al., 2005) on four of the standard benchmar datasets. We also emprcally show that usng AM feature selecton wth SVM reduces the tranng tme by more than 50%, whle mantanng the accuracy of the classfer. Pror Wor Varous technques are used for fndng an "optmal" subset of features from a larger set of possble features. Exhaustvely tryng all the subsets s not computatonally feasble. Hence, automatc feature selecton algorthms are used to fnd the most mportant features n the feature set. In ths secton, we present the commonly used feature selecton algorthms. Odds Rato The basc dea of usng odds rato (Mladenć & Grobeln, 1998) s to calculate the odds of a term occurrng n the postve class (the category a term s related to) normalzed by the odds of that term occurrng n the negatve class (the category a term s not related to). The odds rato of a term t for a category c s defned usng Equaton 1: Odds Rato ( t P( t c )[1 P( t c )], c )..1 [1 P( t c )] P( t c ) Odds Rato s nown to wor well wth the Nave Bayes text classfer algorthm (Mladenć et al., 2004; Mladenć & Grobeln, 1998). Informaton Gan Informaton gan (Qunlan, 1986) s commonly used as a surrogate for approxmatng a condtonal dstrbuton for text classfcaton. In nformaton gan, class membershp and the presence/absence of a partcular term n a gven category are seen as random varables; one computes how much nformaton about the class membershp s ganed by nowng the presence/absence statstcs. If the class membershp s nterpreted as a random varable C wth two values, postve ( c ) and negatve ( c ), and a word s lewse seen as a random varable T wth two values, present ( t ) and absent ( t ), then Informaton Gan s defned as Equaton 2: P( t c) IG ( t, c ) P( t c) log..2 2 P( t) P( c) Ch-Squared c { c, c } t { t, t } The 2 test s used n statstcs to test the ndependence 2 between two events. In text classfcaton, (Yang & Pedersen, 1997; Galavott & Sebastan, 2000; Wu & Flach, 2001) s used to measure the assocaton between a category and features. The 2 measure of a term t for a category c s defned usng Equaton 3:

3 2 P( t ( t, c ) c ) P( t P( t c ) P( t ) P( t c ) P( t ) P( c ) P( c ) c ) Thus, the 2 (t, c ) score ndcates the weght of term t wth respect to category c. If a term s close to more categores, then the score of that term s hgher. The score of each term t s calculated usng Equaton 4: c 2 2 ( t ) P( c ) ( t, c )..4 1 B-Normal Separaton In the B-Normal Separaton (BNS) feature selecton method (Forman, 2003; Forman, 2008), the occurrence of a gven term s modeled n each document by a random normal varable that exceeds a hypothetcal threshold. The prevalence rate s calculated wth respect to both postve and negatve classes. Prevalence rate can be defned as the area under the curve past a certan threshold. Thus, f a term consstently appears n the postve class, the threshold s farther from the tal of the curve than that of the negatve class. BNS s calculated based on the separaton between these two thresholds. Thus, f a term appears more consstently n the postve class than the negatve class, t s assgned a hgher BNS score. BNS s calculated usng the Equaton 5. 1 tp 1 fp BNS F ( ) F ( )..5 tp fn fp tn Where, tp s the number of postve cases contanng the word; fp s the number of negatve cases contanng the word; fn s the number of postve cases that do not contan the word; tn s the number of negatve cases that do not contan the word, and F -1 s the standard normal dstrbuton s nverse cumulatve probablty functon. As reported n (Forman, 2003), BNS+F1 yelds the best performance on most of the tass n comparson wth odds rato, nformaton gan and Ch-Squared. F1 metrcs (Equaton 6) s the harmonc mean of precson (Equaton 7) and recall (Equaton 8). 2 * Precson * Recall F1..6 Precson Recall tp Precson..7 tp fp tp Recall..8 tp fn Improved Gn Index In Gn Index (Breman, Fredman and Olshen, 1984), f a term appears n every document of class c, then t..3 receves a hgh Gn Index score. (Ths s regardless of term occurrence n other classes.) When a term s dstrbuted evenly n the documents of varous categores, the term s then assgned a lower Gn Index score. Gn Index for a term t can be calculated usng Equaton 9. Gn ( t ) P( t ) 1 P( c t ) P( t ) 1 P( c t )..9 However, the Gn Index fals to consder the frequency of documents where the term occurs wthn larger categores. The categores are generally unbalanced wth respect to the number of tranng documents. Hence, the Gn Index score s based wth respect to categores that have a large number of tranng documents. (Wenqan et al., 2007) constructed a new functon called Improved Gn Index that consders a term's condton probablty and combnes the posteror probablty and condton probablty to avod the effects of unbalanced classes n datasets. Improved Gn Index of a gven term t s defned usng Equaton 10: C 2 2 I Gn ( t ) P( t c ) P( c t )..10 Tfcf 1 In tfcf (Chh & Kulathuramayer, 2004), tf refers to term frequency of a term n a gven category and cf refers to nverse category frequency,.e., the rato of total number of categores n a dataset to the number of categores a term falls nto. The tfcf scheme does not dscrmnate between terms that occur frequently n a small subset of documents n a category and terms that are present n a large number of documents throughout a category. Thus, tfcf consders that the less a term occurs across categores, the hgher s ts score. The tfcf of a term t n category c s defned usng Equaton 11: tfcf ( t..11, c ) tf ( t C, c )log cf ( t ) Where C refers to the total number of categores n a dataset, tf(t,c ) s the term frequency of a term t n category c and cf(t ) refers to the number of categores n whch a term t appears. Tfdf In tfdf (Chh & Kulathuramayer, 2004), tf refers to term frequency of a term n a gven document. df s defned as the nverse document frequency,.e., the rato of the total number of documents present n a dataset to the number of documents a gven term appears n. A hgher df of a term ndcates that the term appears n relatvely few documents and may be more mportant durng the

4 process of text classfcaton. tfdf s a commonly used technque for term weghng n the feld of nformaton retreval (Grossman & Freder, 2004) and s also used n text classfcaton (Lavell, Sebastan & Zanol, 2004; Debole & Sebastan, 2003). tfdf of a term t n document d s defned usng Equaton 12: D tfdf ( t, ) (, )log d tf t d..12 df ( t ) Where D refers to the total number of documents n a dataset, tf(t,d ) s the term frequency of a term t n document d and df(t ) refers to the number of documents n whch term t appears. Orthogonal Centrod Feature Selecton (OCFS) The Orthogonal Centrod Feature Selecton (OCFS) (Yan et al., 2005) selects features optmally accordng to the functon mpled by the Orthogonal Centrod algorthm. The centrod of each class (m j ) and also for the entre dataset (m) s calculated usng tranng data. A score for term t s calculated usng Equaton 13. c n j t t m j m 2 OCFS ( t ) ( )..13 n j 1 Where n j s the number of tranng samples that belong to category j and n s the total number of tranng samples. The feature set s pruned by selectng only the features whose scores are hgher than a threshold. OCFS s not greedy n nature le odds rato or nformaton gan. Hence, the OCFS algorthm can be optmzed based on the objectve functon that s mpled by the Orthogonal Centrod algorthm and has been shown to mprove over tradtonal algorthms. Methodology In ths secton, we ntally descrbe our motvaton behnd our AM feature selecton algorthm and formally defne AM. Secondly, we dscuss the dfferences between the AM measure and varous feature selecton algorthms. Fnally we defne a methodology for usng the feature selecton algorthms wth text classfcaton algorthms such as NB and SVM. AM Feature Selecton Algorthm Intally, we descrbe the ntutve motvaton behnd our AM feature selecton approach and then provde a formal defnton. Frst, we consder the human percepton of the topc of a document by glancng at the document and capturng ts eywords. Instead of usng all the terms n a document to determne the subject of a text, normally one bases a decson on the most unambguous words that the eye captures. The person then has an dea of the topc of the document. Some words can easly suggest the category n whch the document can fall nto. For example, f the document has phrases le Chcago Whte Sox and MLB World Seres Champon, then one can suggest that the document relates to baseball n partcular and sports n general. The sample text below s taen from Wpeda 1. By havng a glance at ths text, the reader can guess the category. Metallca s a Grammy Award-wnnng Amercan heavy metal/thrash metal band formed n 1981 and has become one of the most commercally successful muscal acts of recent decades. They are consdered one of the "Bg Four" poneers of thrash metal, along wth Anthrax, Slayer, and Mega-death. Metallca has sold more than 90 mllon records worldwde, ncludng 57 mllon albums n the Unted States alone. The text seems to be about Musc. Our human percepton s based on our nowledge of the doman or what we hear or read on varous subjects n daly lfe. Thus, wthout readng ths specfc text completely, one can confdently clam that the text belongs to Musc rather than Terrorsm or Poltcs. Some terms may be stronger ndcators that a gven text belongs to a certan category than others. Thus, we can gve a score as to how strongly a term suggests a partcular category. We clarfy ths by gvng the followng hypothetcal example. "Carolna Panthers lost the Superbowl ttle to Chcago Bears due to a last mnute touchdown." In the above sentence, we have the terms Bears and Panthers, whch are related to wldlfe. On the other hand, they are also the names of famous NFL football teams. Here we notce uncertanty n classfyng the text to Wldlfe or to Sports categores. Terms such as Superbowl and touchdown, n the same gven text suggest wth more certanty that the text s about Sports. 1 Wpeda.

5 Table 1. Ambguty Measure (AM) example Term Metallca Anthrax Records Category Count AM Count AM Count AM Medcne Musc Sports Poltcs We defne an Ambguty Measure, AM, for each term t wth respect to category c, usng Equaton 14. The maxmum AM score for term t wth respect to all categores s assgned as AM score of term t (Equaton 15). tf ( t, c ) AM ( t, ) c..14 tf ( t ) AM t ) max ( AM ( t, c ))..15 ( Where tf(t,c ) s the term frequency of a term t n category c and tf(t ) s the term frequency of a term t n the entre collecton. We then assgn a hgher score to unambguous terms. In the above example, the term touchdown has a hgher AM than that of the terms Bears and Panthers. The AM score s close to 1 f the term s unambguous. Conversely, f AM s closer to 0, the term s consdered more ambguous and may pont to more than one category. The AM score for the feature Metallca, for the sample text, s 0.99, whch ndcates that the feature Metallca s an unambguous feature and should be ept and not fltered (Table 1). Anthrax s related to the Medcne category wth an AM score of Anthrax s also the name of a famous musc band of the 1980s. Hence, t also appears n the category Musc. Thus, the AM of Anthrax s less than Metallca. In some cases the AM score of some features s low as they appear consstently n multple categores. An example of such s the term Records, whch may appear n all three (sports, musc, and medcne) categores. Thus, the AM score of such a term s low (0.33), and t s desrable to flter out such features. Ths reducton n dmensonalty of the feature set ncreases the accuracy by avodng the terms that have lower AM scores. We emprcally determne a threshold and flter out features whose AM scores are below that gven threshold. Dfferences The feature selecton methods of odds rato, nformaton gan, BNS+F1 and Ch-Squared assgn a hgh score to a term even f t appears n more than one category. Usng such features do not assst a sngle labeled text classfer n dstngushng between categores. AM feature selecton method assgns a hgh score to a term, f t appears consstently n only one specfc category. Such terms then can pont the classfer to that specfc category. For example, consder a term t 1 wth half of ts occurrences n one category c 1 and the other half dstrbuted unformly across the other categores. Term t 1 confdently ponts to category c 1 and hence s assgned an AM score of Consder another term t 2 wth 49% of occurrences n category c 1 and the other 51% of occurrences concentrated n two other categores c 2 and c 3. An AM score of 0.49 s assgned to term t 2. As our goal s sngle-labeled classfcaton, AM assgns a hgher score to term t 1 than term t 2 as t ponts more confdently to category c 1. However, algorthms such as nformaton gan, odds rato, BNS+F1 and Ch-Squared assgn a score to a term that s nversely proportonal to the number of categores that term appears n. Hence, term t 2 (occurs n three categores) s assgned a hgher score than t 1 (occurs n all categores). However, the term t 2 may mslead a sngle-labeled classfer as t also ponts to categores c 2 and c 3 each wth a lower probablty (25.5%). Term t 1 confdently ponts to only category c 1 and hence, should be assgned a hgher score than t 2. In the Improved Gn Index method, the probabltes of a term wth respect to all the categores are consdered. If the term t appears n many documents of category c, then t s assgned a hgh score. In a stuaton where term frequency of the term t n categores c and c j s the same, and also t appears n every document of both categores c and c j, then t s assgned a hgh score. However, as term t belongs to two dfferent categores t s ambguous. Our proposed AM feature selecton method avods such stuaton and assgns a low score to features le t. Usng tfdf and tfcf methods, the terms that appear wth a low frequency n only a sngle category are purged durng the feature selecton process. However, such terms are unambguous and pont to a sngle category. Another problem s that some terms have a smlar dstrbuton n more than one category (low df or cf), but have a hgh term frequency. These terms are selected durng the process of feature selecton as the term frequency of such terms s hgh. These terms are ambguous, as they do not pont strongly to only a sngle category. The AM feature selecton method avods such stuatons by only consderng the rato between the numbers of occurrences of a term n a gven category to the total number of

6 Fgure 1. Bloc dagram for usng feature selecton method on a text classfer occurrences of that term n the tranng set. Thus, both these stuatons are avoded. In OCFS the tranng and the testng tme s quadratc as the centrods of each class and the entre dataset are calculated. However, AM feature selecton method trans and tests n lnear tme (ths s dscussed later n the paper). Usng feature selecton algorthms on SVM and NB text classfers We evaluate our feature selecton algorthm on SVM and Naïve Bayes text classfers. SVM s commonly used as t was shown to perform better n terms of effectveness than other text classfers such as Naïve Bayes, NN, C4.5 and Roccho (Joachms, 1998). Naïve Bayes algorthm s, however, more effcent and scalable than other algorthms (Yang, Zhang & Ksel, 2003). We present the methodology for applyng feature selecton algorthms on SVM and NB text classfers (Fgure 1). Ths process s dvded nto four phases. Phase 1. Calculatng feature selecton scores In the pre-processng step, feature selecton score for each feature n tranng documents s calculated. Phase 2. Flterng terms wth lower feature scores We only eep the features n tranng documents f the feature score of a term s above a certan emprcally determned threshold. We determne these thresholds by exhaustvely optmzng the results of each algorthm on the testng documents. The choce of testng versus separate valdaton set s to be consstent wth the pror wors (Wenqan, et al, 2007) (Yan, et al., 2005) (Chh, Kulathuramayer, 2004) that we compare our wor wth. We compare AM wth both local and global feature selecton algorthms. We globalze the local feature selecton algorthms by selectng the terms wth the hghest local scores. Addtonally, we also experment wth usng round robn method (Forman, 2004) to convert local feature selecton scores nto global scores. Phase 3. Tranng the text classfer Pruned documents from Phase 2 are used by NB and SVM classfers to tran a text classfcaton model. For NB, we use the tradtonal NB classfer as explaned n (Mccullum and Ngam, 1998) to create a text classfcaton model. We use the lnear SVM ernel, as the non-lnear versons gan very lttle n terms of performance (Mladenć et al., 2004). For tranng and testng the SVM model, we use LbSVM software that s commonly used for classfyng the documents nto bnary or mult-labeled categores. Phase 4. Classfyng documents In the testng phase, the traned text classfcaton model s used to classfy testng documents by predctng a 2 Chang C.C., Ln C.J., LIBSVM: a lbrary for support vector machnes, 2001.

7 Table 2. Tme and space complexty for applyng AM on Nave Bayes and SVM Classfer Tranng tme Testng tme per document Space Complexty Nave Bayes usng AM O( N L d +M V) O(M L v) O(M V) SVM usng AM O( N L d +M V + M N c ) c 1.2~1.5 O( M L v) O( M V + N L v+ q 2 ) N - number of tranng documents L d - average document length M- number of categores L v - average number of unque terms n document V sze of vocabulary (features) q constant that depends on the teratons needed category for each. Unle the tradtonal Naïve Bayes text classfer, we as n (Renne, Teevan & Karger, 2003) do not consder pror probablty whle predctng the category for a testng document. As SVM only classfes documents nto two classes (bnary classfer), we use one aganst all (Y & Zheng, 2005) technque to run SVM on multclass datasets. We use sngle labeled classfcaton n ths wor to classfy documents. Hence, only one category s predcted for each testng document by the text classfer. Tme and Space Complexty Analyss AM scores are computed n lnear tme as tranng documents arrve. However, the scalablty of usng AM depends on the text classfer. The comparson of tme and space complexty for applyng AM on Nave Bayes and SVM are gven n Table 2, and are dscussed n the subsectons that follow. Analyss of tme complexty for applyng AM on Naïve Bayes The term frequency of each term per category s calculated. Thus, Naïve Bayes parses NL d terms durng the tranng phase. For every term n the vocabulary, M dfferent AM scores are calculated whch taes O(MV) tme. Thus, the tranng tme for Nave Bayes usng AM s also O(NL d +MV) and equates to O(NL d )(as MV<<NL d ). Durng the testng phase, we calculate the product of AM of terms present n the testng document wth respect to each category, whch taes O(M L v ). A lexcon of all the terms n vocabulary (V) and ther AM scores wth respect to all M categores are stored as the NB model. Many of the features are fltered durng feature selecton process, thus only some of the features and ther AM scores are stored. Space needed by Naïve Bayes usng AM s O(MV). Analyss of tme complexty for applyng AM on SVM As shown, the tranng tme for Nave Bayes usng AM s O(NL d +MV). Thus, AM for all the features n tranng set can be found n lnear tme. SVM, however, trans n quadratc tme. Algorthms used n LbSVM tran n O(M N c ) where c 1.2~1.5 (Yang, Zhang & Ksel, 2003). Hence, the total tme taen for tranng a model usng AM as a preprocessng step of SVM s O(NL d +MV+MN c ). However, as NL d and MV are much smaller than MN c, we consder the tranng and testng tme for usng AM wth SVM as O (M N c ). The space taen for storng SVM model s O(N L v + q 2 ) where q s a constant that depends on the teratons needed. Hence, the space complexty for usng AM wth SVM s O(M V + N L v + q 2 ). Expermental Setup We emprcally evaluated the effectveness of AM feature selecton algorthm usng fve benchmar data sets (Reuters 21578, 20 Newsgroups, WebKB, OHSUMED, Genomcs), whch are commonly used n text classfcaton evaluaton. The detals on these data sets are gven n Table 3. We ntentonally chose these datasets, whch consst of news artcles, web pages and bo-medcal documents, to show the effects of AM on dfferent domans. Although we observe dfferent accuraces across dfferent domans, AM consstently outperforms other feature selecton algorthms over all domans. To show the scalablty of our AM feature selecton approach, usng NB classfer, we also show the effectveness and effcency analyss on TREC 2005 Genomcs dataset, whch contans 4.5 mllon documents. We do not show the results for TREC Genomcs 05 on SVM classfer, as SVM s not scalable for use on very large datasets. (The tranng tme for a SVM model for TREC 05 Genomcs s almost 4 days.) In all our experments, we use a sngle computer, wth an AMD Athlon 2.16Ghz processor and 1 GB of RAM. A bref explanaton about the benchmar datasets that are used n our experments s gven below.

8 Table 3. Benchmar datasets used for our experments Datasets No. of documents No. of Sze of Categores dataset Doman Reuters ,578 Top 10 categores 28 MB News Artcles 20 News Group 20, categores 61 MB News Artcles WebKB 8,282 7 categores 43 MB Web Pages (Unversty webstes) OHSUMED GENOMICS (TREC 05) 54,710(Total) 39,320 (Subset) 4.5 mllon (Total) 591,689 (Subset) Top 50 categores Top 50 categores 382 MB Bo-medcal Documents 15.5 GB Bo-medcal Documents Reuters Dataset The Reuters corpus contans the Reuters news artcles from These documents range from multlabeled, sngle-labeled, or not labeled. The average document length n Reuters dataset s 200 (nonunque) terms per document. Reuters dataset conssts of a total number of 135 categores (labels), ten of whch have sgnfcantly more documents than the rest of the categores. Thus, commonly the top 10 categores are used to evaluate the accuracy of the classfcaton results. The top 10 categores of Reuters are earn, acq, money-fx, gran, trade, crude, nterest, wheat, corn and shp. 20 Newsgroup (20NG) Dataset 20 Newsgroup 4 (20NG) conssts of total of 20,000 documents that are categorzed nto twenty dfferent categores. Each category contans 1,000 documents. The average document length n 20NG dataset s 311 terms per document. Thus, the average sze of the documents s much larger than those n Reuters dataset. Some of the newsgroups categores are very closely related to each other (e.g., comp.sys.bm.pc.hardware and comp.sys.mac.- hardware), whle others are hghly unrelated (e.g., msc.forsale and soc.relgon.chrstan). Ths characterstc contrbutes to the dffculty of categorzaton of documents that belong to very smlar categores. WebKB Dataset The WebKB dataset 5 s a collecton of web pages from four dfferent college webstes namely Cornell, Texas, 3 4 Lews D., Reuters-21578, resources/testcollectons/reuters Lang K., Orgnal 20 Newsgroups Dataset. people.csa.mt.edu/jrenne/20newsgroups 5 WebKB dataset. cs.cmu.edu/ project/theo-20/www/data/ Washngton, Wsconsn and some mscellaneous web pages. These web pages are pre-classfed nto seven categores: student, faculty, staff, department, course, project and others. WebKB contans 8,282 web pages. The average document length n WebKB dataset s 130 terms. OHSUMED Dataset OHSUMED (Hersh, Bucley, Leone & Hcman, 1994) s a collecton of Medlne documents,.e., medcal ctatons, from 1987 to 1991, and s commonly used for bo-medcal lterature search evaluaton and classfcaton. We use only the top (largest) 50 categores wth documents publshed n The average document length n OHSUMED dataset s 63 terms per document. The dstrbuton of documents n OHSUMED dataset s uneven. The largest category contans 2,415 documents, whle the smallest category contans 873 documents. Hence, more tranng data are avalable for some categores as compared to others. TREC 2005 Genomcs Dataset TREC 05 GENOMICS s a collecton of 4.5 mllon bomedcal documents and s 15.5 GB n sze. Ths s the largest publcly avalable benchmar dataset that contans categorzed (labeled) documents n the doman of bonformatcs. The average document length s 183 terms per document. We are not aware of any text classfcaton efforts on TREC 05 GENOMICS 6 data set. Thus, for ths dataset no comparson wth pror efforts was possble. We used the data processed by (Urban, Goharan & Freder, 2007). They use a pre-processng model that breas up gene names and s shown to perform well. Acronyms and ther long-forms are dentfed durng preprocessng usng the Schwartz and Hearst algorthm (Schwartz & Hearst, 2003). An example of such long-short form would nclude 6 TREC 2005 Genomcs dataset.

9 mmuno defcency enzyme (IDE), and a short-long form would nclude IDE (mmuno defcency enzyme). The algorthm wors bacwards through the long form text and attempts to dentfy correspondng letters n the acronym. All terms are toenzed, stop words removed, and lexcal varants are generated. Porter stemmng (Porter, 1997) s used on each toen wth the followng exceptons: gene names (as defned by the Entrez Gene database); all upper case, mxed case, alpha-numerc terms; and non-gene terms that would become a gene name after beng stemmed. Smlar to OHSUMED dataset, the top (largest) 50 categores are chosen those that contan hghest number of documents for GENOMICS dataset. Smlarly, the categores are raned based on the number of documents. Ths subset of Genomcs dataset contans 591,589 documents. The category that contans the hghest number of documents contans 295,773 documents whle the category among top 50 categores that contans least number of documents has 8,049 documents. Hence, f we choose the categores after the top 50, then the number of tranng documents n these categores s very low, leadng to a lower classfcaton accuracy. Evaluaton Metrcs To evaluate the effectveness of our approach and compare to the state of the art feature selecton research results, we use the commonly used evaluaton metrcs precson, recall and F1 measure. True Postve Precson (P) True Postve False Postve.. 16 Precson (Equaton 16) s defned as the rato of correct classfcaton of documents nto categores to the total number of attempted classfcatons. True Postve Recall (R) True Postve False Negatve.. 17 Recall (Equaton 17) s defned as the rato of correct classfcatons of documents nto categores to the total number of labeled data n the testng set. 2* Precson * Recall F1Measure Precson Recall..18 F1 measure (Equaton 18) s defned as the harmonc mean of precson and recall. Hence, a good classfer s assumed to have a hgh F1 measure, whch ndcates that classfer performs well wth respect to both precson and recall. We present the mcro-averaged results for precson, recall and F1 measure. Mcro-averagng consders the sum of all the true postves, false postves and false negatves that are generated n ten runs of 10-fold cross valdaton (Lews, 1991). Results We organze the results nto two subsectons. In the frst subsecton, we present the result for Naïve Bayes classfer usng AM feature selecton method. In the second subsecton, the results for AM feature selecton wth SVM classfer are presented. Nave Bayes usng AM We evaluated the expermental results usng Reuters 21578, 20NG, WebKB, OHSUMED and TREC 05 Genomcs datasets. We present the comparson of AM feature selecton algorthm wth the eght feature selecton algorthms explaned earler n the pror wor secton. We vared the threshold to dentfy the optmal F1 measure for each feature selecton method. The results show that AM outperforms the others statstcally sgnfcantly wth a confdence level of at least 95%. We demonstrate the effects of usng round robn method, whch s used for globalzng the localzed feature selecton score. We also present the effects of AM on the tranng and testng tme for Nave Bayes classfer. Comparson wth other feature selecton algorthms usng Nave Bayes classfer We used stratfed 10-fold cross valdaton for all the datasets except WebKB. We used a standard 4-1 splt for WebKB where the data for three unverstes are used for tranng and the data for one unversty s used as a testng set. We vared thresholds to observe the best results wth respect to F1. Our results show that AM comparatvely performs better than the next best performng feature selecton algorthms by 20%, 7.5%, 0.25%, 2.14%, and 2.6%, on OHSUMED, TREC 05 Genomcs, Reuters 21578, 20 Newsgroups, WebKB datasets, respectvely. Fgure 2 shows the comparson of eght feature selecton algorthms on Reuters dataset wth respect to F1 measure. Our expermental results show that AM (Precson: 92.36%, Recall: 85.72%, F1: 88.92%) performs better than tfdf (Precson: 90.78%, Recall: 86.69%, F1: 88.69%) and BNS+F1 (Precson: 88.13%, Recall: 88.01%, F1: 88.07%), whch are the next best performng algorthms. As all the feature selecton algorthms perform well on Reuters dataset, the F1 mprovement when usng AM measure s only 0.25% (95% confdence). Statstcal sgnfcance of the AM wth respect to other feature selecton algorthms for varous datasets s reported n Table 4. For 20 Newsgroups (Fgure 3), AM (Precson:

Fgure 2. Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters 21578 dataset for Nave Bayes Fgure 3.

Comparson of AM wth other feature selecton methods n terms of F1 measure on WebKB dataset for Nave Bayes Fgure 5.

Comparson of AM wth other feature selecton methods n terms of F1 measure on TREC 05 Genomcs dataset for Naïve Bayes Fgure 7.

10 Fgure 2. Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters dataset for Nave Bayes Fgure 3. Comparson of AM wth other feature selecton methods n terms of F1 measure on 20 Newsgroups dataset for Nave Bayes Fgure 4. Comparson of AM wth other feature selecton methods n terms of F1 measure on WebKB dataset for Nave Bayes Fgure 5. Comparson of AM wth other feature selecton methods n terms of F1 measure on OHSUMED dataset for Nave Bayes Fgure 6. Comparson of AM wth other feature selecton methods n terms of F1 measure on TREC 05 Genomcs dataset for Naïve Bayes Fgure 7. Effect of feature selecton on tranng and testng tme of Naïve Bayes usng AM on TREC 05 Genomcs dataset 91.68%, Recall: 91.69%, F1: 91.72%) performs sgnfcantly better than the next best feature selecton algorthm, mproved gn ndex (Precson: 91.69%, Recall: 87.97%, F1: 89.79%), by 2.14%. Although the mprovement s margnal, the results are statstcally sgnfcant by at least 95% confdence. The results on WebKB dataset, whch are gven n Fgure 4, show that AM (Precson: 74.34%, Recall: 73.76%, F1: 74.05%) performs better than the second best performng algorthm, mproved gn ndex (Precson: 71.74%, Recall: 72.56%, F1: 72.15%), by 2.6%. WebKB dataset conssts of web pages, whch contans mages, tables and other anchor text. Classfyng such documents s more dffcult than

11 Table 4. Statstcal comparson of AM and other feature selecton algorthms on Naïve Bayes wth respect to F1 measure (Pared t-test) Algorthm Datasets Reuters Newsgroups WebKB OHSUMED Genomcs Odds Rato BNS+F tfdf tfcf Info Gan Ch-Squared OCFS Imp. Gn : AM s statstcally sgnfcantly better than the feature selecton algorthm by 95% confdence ++ : AM s statstcally sgnfcantly better than the feature selecton algorthm by 99% confdence classfyng plan documents from Reuters and 20 Newsgroups datasets. Hence, the classfcaton effectveness for WebKB dataset s lower than Reuters and 20 Newsgroups datasets. On bomedcal datasets, our results ndcate that AM (Precson: 65.93%, Recall: 54.84%, F1: 59.88%) statstcally sgnfcantly mproves (20%) over mproved gn ndex (Precson: 53.83%, Recall: 46.54%, F1: 49.92%) on OHSUMED dataset (Fgure 5). AM (Precson: 61.71%, Recall: 60.54%, F1: 61.12%) also shows a statstcally sgnfcant mprovement of 7.5% over mproved gn ndex (Precson: 61.71%, Recall: 52.64%, F1: 56.82%) for TREC Genomcs 05 dataset (Fgure 6). Improved Gn ndex s the second best performng algorthm on both these datasets. Dscusson The motvaton for usng AM feature selecton s to select terms that belong to only one category. As mentoned n ntroducton, ambguous features lead to wrong classfcaton predctons n unbalanced datasets. Our results ndcate that AM performs better than odds rato, nformaton gan, tfcf, tfdf, BNS+F1 and Ch- Squared on OHSUMED and Genomcs datasets by more than 30% (comparatve gan). OHSUMED and Genomcs datasets are unbalanced and a large number of tranng documents belong to the top two categores. The feature selecton methods such as odds rato, nformaton gan, tfcf, tfdf, BNS+F1 and Ch-Squared use both postve and negatve examples to assgn scores to the features. A hgh score s assgned to a feature even f t appears evenly n only 2 or 3 categores out of 50. As the number of tranng documents n the top two or three categores s large, many features only appear n the top two or top three categores. Such features are assgned hgh scores. These features mslead the text classfer and hence, many false postves are generated durng the testng phase. Such features are assgned a low AM score and are fltered durng the process of feature selecton. Fgure 8. Comparson between AM wth/wthout round robn method Improved gn ndex nullfes the effects of unbalanced classes n dataset by combnng the posteror probabltes and condton probabltes for each term. OCFS s optmzed based on the number of documents avalable n each class. Hence, mproved gn ndex and OCFS perform comparatvely better than odds rato, nformaton gan, tfcf, tfdf, BNS+F1 and Ch-Squared algorthms. However, our results ndcate that mproved gn ndex and OCFS perform statstcally sgnfcantly worse than AM on unbalanced dataset such as OHSUMED and Genomcs. Globalzng Feature Selecton Scores As feature selecton algorthms such as tfcf, odds rato, nformaton gan, ch-squared, BNS+F1 and AM are local feature selecton algorthms, we have used the tradtonal method (selectng the terms wth the hghest local scores) to convert ther local scores to global feature selecton score. Addtonally, smlar as n (Forman, 2004), we used round robn method to convert the local feature selecton score nto global score. Round robn method selects the top n features from each category. Thus, the categores wth low number of tranng documents also have the same number of features n the feature set that represent them. Ths method mproves the effectveness n dentfyng the documents that belong to categores that have less tranng documents and leads to an mprovement n macro-f1, whch s the average of F1 measure of all categores.

Fgure 9. Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters 21578 dataset for SVM Fgure 10.

Comparson of AM wth other feature selecton methods n terms of F1 measure on WebKB dataset for our SVM However, the classfcaton accuracy of the categores wth a large number of tranng documents

12 Fgure 9. Comparson of AM wth other feature selecton methods n terms of F1 measure on Reuters dataset for SVM Fgure 10. Comparson of AM wth other feature selecton methods n terms of F1 measure on 20 Newsgroups dataset for SVM Fgure 11. Comparson of AM wth other feature selecton methods n terms of F1 measure on WebKB dataset for our SVM However, the classfcaton accuracy of the categores wth a large number of tranng documents decreases. As we are usng stratfed splts for each dataset, the number of tranng documents that belong to a category s drectly related to the number of testng documents that belong to that category. Hence, the mcro-f1 measure when usng round robn method decreases. We provde the results of AM versus usng AM wth round robn method n Fgure 8. We observed that usng round robn method mproves macro F1 measure by 1.7% whle decreasng the mcro F1 by 5.2% for Reuters dataset. Tradeoff of accuracy and tme wth respect to AM thresholds for Nave Bayes We now present the effects of AM threshold on the tranng and testng tme of Nave Bayes usng TREC 05 Genomcs dataset (Fgure 7). We performed smlar experments on other datasets and observed the same trends. As the TREC 05 Genomcs dataset s relatvely large, the trends wth respect to tranng and testng tme are observed clearly. Hence, we only report the results for TREC 05 Genomcs dataset. Fgure 12. Comparson of AM wth other feature selecton methods n terms of F1 measure on OHSUMED dataset for our SVM The tranng tme complexty of Nave Bayes usng AM s O(NL d +MV) where N s the number of documents, L d s the average document length, M s the number of categores and V s the total terms n the vocabulary. As N, L d, M and V are all constant durng the tranng phase, the tranng tme of our algorthm s constant (Fgure 7). The features whose AM s above the threshold are ept. The space complexty of our Nave Bayes usng AM s O(MV). As the sze of V decreases when the threshold ncreases, there s a slght drop n the tranng tme. Though there s margnal decrease n tranng tme durng feature selecton phase, the tme complexty for applyng AM on Nave Bayes classfer s lnear and s faster than other commonly used algorthms such as SVM. The tme complexty n the testng phase s O(ML v ), where L v s the total number of unque terms per testng document. As we start selectng fewer features (ncrease the threshold), the value of M remans constant, whle the value of L v decreases. Ths s because fewer features are avalable n the feature set and hence, less unque terms are used from each document n testng set. Hence, as we

13 Table 5. Statstcal comparson of AM and other feature selecton algorthms wth respect to F1 measure (Pared t- test) Algorthm Datasets Reuters Newsgroups WebKB OHSUMED Genomcs Odds Rato BNS+F tfdf tfcf Info Gan Ch-Squared OCFS Imp. Gn : AM s statstcally sgnfcantly better than the feature selecton algorthm by 99% confdence ncrease the threshold the testng tme consstently decreases. It s also observed that as the threshold ncreases up to 0.4, F1 measure ncreases whle there s a reducton n testng tme. SVM usng AM In ths secton, we favorably compare our results of applyng AM feature selecton usng SVM to the results usng the same eght feature-selecton algorthm. We vared the threshold to dentfy the optmal F1 measure for each feature selecton method. We demonstrate how AM feature selecton reduces the tranng tme whle mprovng the F1 measure. We also explan the effects of the AM threshold score on the classfcaton results. Comparson wth other feature selecton algorthms for SVM classfer SVM trans wth a tme complexty of O(NL d +MV+MN c ) where N s the number of documents, L d s the average document length, M s the number of categores, V s the total terms n the vocabulary and c s a constant (c 1.2~1.5). SVM n nature s not a scalable algorthm. We use the ModApte splt for Reuters dataset and 9-1 splt for 20 Newsgroups dataset and OHSUMED dataset as gven on the LbSVM dataset webste. We use a standard 4-1 splt for WebKB where the data for three unverstes s used for tranng, and the data for one unversty s used as a testng set. We use these splts as they are readly avalable and commonly used n the pror wors (Wenqan et al., 2007; Yan et al., 2005). AM performs statstcally sgnfcantly better than the eght feature selecton algorthms wth a confdence of 99% (Table 5). Improved Gn Index s the second best performng algorthm for all the four datasets. Therefore, we present the comparson of AM wth Improved Gn Index. Our expermental results on Reuters (Fgure 9) dataset ndcate that AM (F1: 89.1%) performs better than Improved Gn Index (F1: 88.6%) by 0.56%. For 20 Newsgroups dataset (Fgure 10), whch s another dataset that contans news artcles, AM (F1: 78.74%) outperforms mproved gn ndex (F1: 77.3%) by 1.8%. The result on WebKB dataset (Fgure 11), whch s a dataset that contans web pages, ndcates that AM (F1: 76.14%) outperforms Improved Gn Index (F1: 75.54%) by 0.8%. For the OHSUMED dataset (Fgure 12) that contans bomedcal documents, AM (F1: 60.74%) outperforms Improved Gn Index (F1: 58.23%) by 4.3%. Dscusson Our results for SVM usng AM also ndcate that mprovements n OHSUMED, whch s very unbalanced dataset, are better than n other datasets. OHSUMED has the majorty of documents n the frst few (2-3) categores and fewer documents n the other 50 categores. Ths mprovement s acheved due to the selecton of the features that pont to only one category (unambguous features). SVM classfcaton s based on the entre set of terms n the testng document and not on only unambguous features. Hence, the mprovements observed usng SVM are smaller than those observed usng Naïve Bayes classfer. All features from the testng documents are used for classfyng a document. LbSVM always predcts one category for each document. When a category s wrongly predcted, a false postve s generated; a false negatve s also generated because a true predcton s not made. Precson and recall for all the runs usng LbSVM are the same. Precson and recall vary for Naïve Bayes because when the AM threshold s hgh, the number of eywords s sparse and some documents do not contan any terms that are above the thresholds. Such documents are predcted as uncertan and only a false negatve s generated n such cases. As we flter more features from the feature set, the number of uncertan cases ncreases and recall decreases.

14 Fgure 13. Effect of feature selecton on tranng and testng tme of SVM usng AM on OHSUMED dataset Tradeoff of accuracy and tme wth respect to AM thresholds for SVM We now report the effect of the AM threshold on F1 measure and the correspondng tme taen to tran the model and classfy the documents usng SVM classfer s depcted n Fgure 13, whch shows results for OHSUMED dataset. Other datasets also show the same trends. The x-axs represents dfferent threshold values and the y-axs represents mcro-f1 measure and tme. The threshold value ndcates that all features whose scores are above that threshold are selected and the remanng features are fltered. As we apply AM feature selecton, mcro-f1 ncreases (Fgure 13). We obtan the best mcro- F1 when the threshold s set to 0.2. As the threshold s ncreased, the mcro-f1 starts to decrease. Ths ndcates that when the threshold s less than 0.2, most of the features that are fltered are ambguous and lead to an mprovement F1 measure. When the threshold s above 0.2, most of the features that are fltered contan relevant nformaton. Thus, the F1 measure of the classfer decreases. The tranng tme ncludes the feature selecton tme and the tme taen to tran the SVM model. The testng tme s the tme taen by LbSVM to classfy the testng data. Fgure 13 demonstrates that when no feature selecton s used.e., when threshold s equal to zero, the tme taen for tranng on OHSUMED dataset s 3356 seconds. When we reduce the dmensonalty of feature set, by settng the threshold to 0.2, the tranng tme also decreases to 1623 seconds. Ths shows that even though the learnng tme s reduced by more than 50%, we stll obtan better F1 measure than when we do not apply any feature selecton. One of the lmtatons of usng feature selecton algorthm on SVM s that a proper threshold must be found for a gven dataset. We found the threshold for Reuters and WebKB dataset to be 0.2 and for 20 Newsgroups and OHSUMED datasets threshold was 0.3. To further nvestgate ths problem, we expermented on two addtonal standard datasets from statlog collecton (Mche, Spegelhalter & Taylor, 1994) called DNA dataset (3 categores; 2,000 tranng documents; 1,186 testng documents) and Vehcle dataset (4 categores; 761 tranng documents; 85 testng documents). Smlarly, we found that a threshold between 0.2 to 0.3 yelds the best results on all the four datasets we used for our expermentatons. Concluson We presented a new feature selecton algorthm called Ambguty Measure (AM). The underlyng premse behnd the AM approach s the quc dentfcaton of unambguous terms. We defne unambguous terms as features that belong to only one category. We showed how AM s used wth Naïve Bayes classfer. The most unambguous terms (eywords) from the tranng documents are selected usng AM and a classfcaton model s bult. Based on ths model, the documents that are to be classfed are scanned to dentfy the eywords; and the ambguty measures (AM) of the eywords are used to calculate the probablty that the document falls n a specfc category. The category wth the hghest probablty s selected as the category for that document. We emprcally evaluated the performance of our methodology for usng AM wth Nave Bayes classfer usng fve standard benchmar data sets (Reuters 21578, 20 News Groups, WebKB, OHSUMED and TREC 05 Genomcs collecton). Our expermental results demonstrate that AM performs statstcally sgnfcantly better than eght exstng feature selecton algorthms usng fve benchmar datasets wth a confdence of at least 95%. We also appled AM as a preprocessng step for SVM classfcaton algorthm. We showed that AM feature selecton reduces the tranng tme of the SVM classfer, whle mantanng ts effectveness. Experments were performed on four standard benchmar datasets. Our results ndcated that AM performs statstcally sgnfcantly better than the current publshed state of the art feature selecton algorthms on SVM classfer. Our results also ndcated that AM feature selecton mproved over odds rato, nformaton gan, Ch-Squared, BNS+F1 and tfcf on unbalanced datasets le OHSUMED and Genomcs, where majorty of documents belong to only 2-3 categores. Our analyss showed that selectng the features that pont to only one category performs better than selectng features that pont to more than one category. Words that pont to more than one category may

Learning Ensembles of Convolutional Neural Networks

Learning Ensembles of Convolutional Neural Networks Learnng Ensembles of Convolutonal Neural Networks Lran Chen The Unversty of Chcago Faculty Mentor: Greg Shakhnarovch Toyota Technologcal Insttute at Chcago 1 Introducton Convolutonal Neural Networks (CNN)