Finding Person X: Correlating Names with Visual Appearances

Fndng Person X: Correlatng Names wth Vsual Appearances Jun Yang, Mng-yu Chen, and Alex Hauptmann School of Computer Scence, Carnege Mellon Unversty Pttsburgh, PA 1513, USA {juny, mychen, alex}@cs.cmu.edu http://www.nformeda.cs.cmu.edu Abstract. People as news subjects carry rch semantcs n broadcast news vdeo and therefore fndng a named person n the vdeo s a major challenge for vdeo retreval. Ths task can be acheved by explotng the mult-modal nformaton n vdeos, ncludng transcrpt, vdeo structure, and vsual features. We propose a comprehensve approach for fndng specfc persons n broadcast news vdeos by explorng varous clues such as names occurred n the transcrpt, face nformaton, anchor scenes, and most mportantly, the tmng pattern between names and people. Experments on the TRECVID 003 dataset show that our approach acheves hgh performance. 1 Introducton The dramatc ncrease of dgtal vdeos demands more effcent and accurate access to vdeo content. Content-based analyss and retreval has been extensvely used for vdeo segmentaton [], vdeo retreval [3], and mage retreval [1]. As dscussed n [4], fndng a specfc person n vdeos s essental to understand and retreve vdeos. Although solvng ths problem mght be dffcult for general vdeos, n ths paper we target at very specfc content namely broadcast news vdeo. Snce news vdeos are strongly related to human subjects, fndng "person X" s an mportant and frequent challenge. Takng advantage of the multmodal content n vdeos, we propose a people-fndng approach whch explots name occurrence n transcrpt, vdeo structure, and vsual nformaton such as faces and news anchor scenes. Specfcally, ths approach utlzes a tmng model to overcome the temporal offset between names and persons, whch wll otherwse compromse performance. Our approach was developed and evaluated usng the dataset from TREC 003 Vdeo Track (VIDTREC) [5], whch s dvded nto a tranng set (FSD) and a testng set (FST), each consstng of over 100 hours of ABC, CNN, and C-SPAN news vdeo. Transcrpt search wth tmng-based score propagaton An essental clue for fndng a person n the broadcast news vdeo s the menton of hs/her name n the transcrpt, acqured ether from a speech recognzer or from closed

captons. Ths clue ndcates that ths person s lkely to appear vsually. We do not address the rare cases where a person appears wthout hs/her name beng mentoned. In ths secton, we dscuss usng transcrpt to fnd and rank vdeo shots that contan specfc persons. Here a vdeo shot s defned as an unbroken sequence of frames taken by one camera and t serves as a basc structural unt n our vdeo retreval..1 Basc transcrpt-based search Snce the transcrpt s temporally algned wth the vdeo, each shot s assocated wth a porton of the transcrpt that falls wthn ts boundary. Therefore, an ntutve way to fndng a specfc person n vdeo s to use text-based retreval technques to fnd the shots whch contan the name. Specfcally, we employ the TFIDF retreval method [6], whch gves the smlarty between a shot S and a person named X as: (1) (, ) = N R X S tf log log N tf n t X n t X where tf s the frequency of term t (as a part of the name X) n the transcrpt of shot S, N s the total number of shots, and n s the number of shots whose transcrpt has t.. Modelng tmng between names and persons The method above s subject to a severe problem: t s not necessarly the case that a person appears n the vdeo concurrently wth the name mentoned n the transcrpt. Based on the statstcs we have collected, n more than half the cases, a person does not show up n the shot where the name s mentoned, but before or after that shot. Undoubtedly, ths msmatch serously compromses the performance of text-based shot retreval, whch explores only the shots contanng the person's name. The tmng between vsual appearances (.e., face) and occurrences of a name s related to the "vdeo grammar" of broadcast news. In a typcal news story, an anchorperson brefs the news at the begnnng, followed by several shots showng the news event and sometmes ntervews and reporters. The name of a human subject n the news s normally frst mentoned by the anchorperson, whle hs/her face s not always shown at that tme. In the followng shots, ths person may appear several tmes n the vdeo, roughly nterleaved wth occurrences of the name n the transcrpt. However, there are also cases where a person not mentoned by the anchorperson later appears n the shots, wth or wthout hs name mentoned n close proxmty. Generally, no smple pattern s able to capture the possblty of such tmng, but t s stll true that a person s more lkely to appear n the (temporal) proxmty where hs name s mentoned. Loosely speakng, the closer s the shot to name occurrence, the more lkely t contans the person's vsual appearance. As an example, we collected all the vsual appearances of "Bll Gates" n FSD, and plot n Fg.1 the frequency of these appearances at each quantzed dstance from ther closest occurrence of hs name. The dstance s measured n terms of tme or shot offset (number of shots between). The "0" pont on the dstance axs s where the name s mentoned, and postve dstance means that a person appears vsually after the name s mentoned.

(a) Dstance n seconds (b) Dstance n shot offset Fg. 1. The frequency of Bll Gates' vsual appearances assocates wth name occurrences, and the Gaussan curves capturng the frequency dstrbuton. Based on Fg.1, t s ntutve to model the frequency of a person's vsual appearance w.r.t hs name occurrence usng a Gaussan model. For a specfc person, we estmate a Gaussan dstrbuton from the dstances from each of hs vsual appearances n FSD to the closest name occurrence, both of whch are manually labeled, usng maxmum lkelhood estmaton. Agan, the dstance s measured n terms of tme or shot offset. In Fg.1, we supermpose the curves of the estmated Gaussan dstrbutons for "Bll Gates", whch ncely capture the shape of the bns showng the frequences. Totally 0 persons are selected for study, varyng from frequently appearng ones lke "Mchael Jordan" to rare ones lke "Alan Greenspan". Table 1 shows the number of vsual appearances of each person n FSD and FST respectvely. The mean and standard devaton of the Gaussan dstrbuton of each person estmated on FSD s ploted n Fg. (a) for tme-based dstance and n Fg.3 (b) for shot-based dstance. People are ordered from left to rght n descendng frequency of ther vsual appearance n FSD. A global dstrbuton computed from a pool of the tranng data from all the people s shown alongsde. Table 1. The 0 people studed and the number of ther vsual apperances n FST and FSD. Name Lewnsky Jordan Yeltsn Starr Albrght Gnsburg Pope Mccartney Gates Dana FSD/FST 53 / 44 47 / 75 40 / 10 37 / 35 30 / 40 8 / 9 / 45 6 / 10 / 19 1 / 7 Name Malone Netanyahu Kendall Hllary Arafat Kohl Greenspan Suharto Jang Laden FSD/FST 11 / 19 7 / 4 6 / 3 6 / 1 3 / 33 3 / 6 / 6 / 0 / 19 0 / 6 As shown n Fg. (a), for the frst 9 people on the left, each of who appears 0+ tmes n FSD, the estmated dstrbutons have smlar mean values (1-3 sec.) and moderate standard devatons (3-6 sec.). Ths suggests that the Gaussan assumpton s reasonably good for these people, and ther dstrbutons are smlar to each other. Therefore, on average a person appears about seconds after hs name s mentoned n the "grammar" of news vdeo. For the people wth less than 0 appearances n FSD, however, the estmated dstrbutons dffer sgnfcantly: the mean vares from - to 14 seconds, and the standard devaton can be as large as 1 seconds. But t s not far to say that each nfrequent name has a unque dstrbuton, snce our observaton s

based by the nsuffcent tranng data n FSD used to estmate ther dstrbutons. We wll explore ths queston further n our experments. The same trend s observed n the shot-based dstrbutons n Fg. (b). Seconds number of shots Lewnsky Jordan Yeltsn Starr Albrght Gnsburg Pope Mccartney Gates Dana Malone Netanyahu Kendall Hllary Pppen Arafat Kohl Greenspan Suharto Jang Laden Global 15 13 11 9 7 5 3 1-1 -3 3.5 1.5 1 0.5 Lewnsky Jordan Yeltsn Starr Albrght 0-0.5-1 Gnsburg Mean Standard devaton Mean Pope Mccartney Gates Dana (a) tme-based dstance Standard devaton Malone Netanyahu Kendall Hllary Pppen (b) shot-based dstance Kohl Greenspan Arafat Suharto Jang Laden Global Fg.. The mean and standard devaton of the Gaussan dstrbutons for each person.3 Search methods wth score propagaton Gven the tmng nformaton, t s obvous that the basc transcrpt-based search can be mproved by propagatng the smlarty scores from the shots contanng the ntended person's name to the neghborng shots n a wndow. The propagaton s carred out as: R ( X, S) = f ( S, S ) R( X, S ) () p S S < w where w s the sze of the wndow measured ether by tme or by shot offset, and f(s, S ) s a weghtng functon wth output wthn (0, 1), whch decdes the score beng propagated to neghborng shots. The summaton traverses all the shots S that are n the neghborhood of S and have the ntended name n the transcrpt. The weghtng functon f(s, S ) can take many forms, dependng on the desgn decsons made along the followng dmensons:

Flat wndow or weghted (Gaussan) wndow: In a flat wndow, f(s, S ) s a constant and all the shots n the wndow are propagated wth the same score. In a weghted wndow, however, the score propagated to each shot s determned by ts probablty of contanng the person's vsual appearance, whch s calculated from the densty functon of a Gaussan dstrbuton. In ths case, f(s, S ) s end (3) f ( S, S ) = N ( u, σ ) start where start and end are the startng and endng poston of S n relaton to S (whch has the ntended name), and N ( u, σ ) s the densty functon of the Gaussan dstrbuton. Tme-based or shot-based dstance measure: Ths decdes whether to use a tmebased Gaussan model N t X ( u, σ ) or a shot-based one N X s ( u, σ ). Ths makes a dfference snce the shot length dffers a lot, and t s unclear whch measure s more desrable as to revealng the relatonshp between a person's vsual appearance and the name occurrence. Local, global or combned Gaussan dstrbuton: To search for a person, we can use the local Gaussan dstrbuton traned partcularly for ths person N X ( u, σ ), the global dstrbuton traned on all the people N G ( u, σ ), or a combnaton of them N C ( u, σ ). Intutvely, f each person has a unque dstrbuton and there s enough tranng data, the local (people-specfc) model s more desrable; otherwse the global one s better. The combned model uses a dstrbuton ntegrated from both the local dstrbuton and the global one. Inspred by the smoothng technques used to overcome the sparse tranng data problem n nformaton retreval [8], ths model "smoothes" a person's local dstrbuton estmated from nsuffcent data wth the global dstrbuton. Specfcally, the probablty densty functon of the combned dstrbuton s a lnear combnaton of that of the local and the global dstrbuton, where the weght s determned by the amount of tranng data assocated wth the person. It s formulated as: N C = α N X + ( 1 α ) N and α = TX sgmod ( γ ) G β (4) where α s the weght computed from the number of tranng data T X for person X, and β and γ are constants, whch are set to 10 and 1 as determned by our nformal experments. Accordng to the property of sgmod functon, α approaches 1 when T X ncreases, and vce versa (e.g., α = 0.5 when T X =10, and α = 0.88 when T X =30). Therefore, the more tranng data we have observed, the more the combned dstrbuton s determned by the local dstrbuton. 3 Face searchng and Anchor flterng Vsual nformaton provdes valuable clues for fndng a person n news vdeo. Unlke text nformaton whch roughly estmates where a person s, vsual nformaton can tell the exact poston and tme of the person's appearance. Face recognton

technology can match a person's face vsually and predct ts dentty, though ts performance s sgnfcantly affected by pose and llumnaton varances. Another mportant vsual clue comes from the anchor detecton, snce people as news subjects seldom occur durng the anchor shot. We apply the well-known Egenface algorthm [9] for face recognton. Faces are collected usng a face detecton system [10], converted to gray levels and normalzed to a standard sze. Prncpal component analyss (PCA) s performed to construct Egenfaces, whch encode the most dstngushng parts of faces whle gnore smlar parts. The Egenface representaton has been shown to be a farly robust approach to face recognton. However, t also has several drawbacks and the most serous one s pose varatons, as non-frontal faces usually have much poorer recognton results than frontal ones. Lghtng condtons present another serous problem. In broadcast news, due to the large varatons n news footage, both the pose and lghtng condton of faces vary largely, resultng n unrelable face recognton. To avod the face recognton dffcultes, we frst use the trustworthy text nformaton to fnd some shots as ntal results, and apply face recognton on them to obtan addtonal clues for refnng the ntal results. In ths way, the number of faces to be recognzed s largely reduced and the accuracy can be mproved. To address the wde varance on pose and lghtng condtons, we fnd external mages that contan the target face wth vared condtons and use them as examples to recognze relevant faces. The (nternal) faces to be recognzed are extracted from the -frame of the shots to be examned. Let the external Egenfaces be denoted as {F1, F, F3, Fn} and the nternal Egenfaces be denoted as {f1, f, f3, fm}. By matchng every nternal face wth a specfc external face F j based on Egenface, we obtan a rankng of all nternal faces ordered by descendng smlarty to F j. The fnal rank of an nternal face s combned from ts ranks wth all the external faces, gven as: n 1 1 (5) R( f ) = n R f j= 1 where R j (f ) denotes the smlarty rank of nternal face f wth external face F, and R(f ) denotes the fnal rank of f. Snce the external faces provde varances n pose and lghtng condton, the fnal rank gves us a more robust predcton. Snce a shot may has more than one -frames, we average the rank of the face on every -frame of the shot to get the score ndcatng how lkely the shot contans the target face. More detals of our face recognton method can be found n [11]. The ncluson of anchor detecton assumes that anchors seldom co-occur wth a news subject person. We have bult an anchor detector [3] based on multmodal classfcaton that combnes three nformaton sources: the color hstogram from mage data, speaker ID from audo data, and face nformaton from face detecton. Face nformaton contans the poston, sze and detecton confdence of faces. Fsher s Lnear Dscrmnant (FLD) s appled to select dstngushng features for each source of nformaton. Selected features are syntheszed nto a new feature vector of each shot, and the classfcaton s performed on these feature vectors. The fnal predcton of the appearance of the target person s made by lnearly combng the results of text-based search, anchor detecton and face recognton: j ( )

( S) = α T ( S) + β Anchor( S) + F( S) (6) P pror γ where α, β and γ are weghts for the three predctons, whch are traned on a held-out set from FST (as FSD has been used to tran to dstrbuton). 4 Experment results Experments n fndng the 0 selected persons n the TRECVID 003 collecton are conducted to determne the best people-fndng method among those proposed n Sect..3. Frstly, we compare the performance of the basc transcrpt-based search method wthout score propagaton (denoted as Baselne), the method wth flatwndow propagaton (Flat_Wn), the one wth shot-based Gaussan propagaton usng the local dstrbuton estmated from FSD (Shot_Gauss_Local), and ts tme-based counterpart (Tme_Gauss_Local). For each person, we use each method to fnd the shots n FST that contan hs/her vsual appearance and compute the mean average precson (MAP) [7] of the results. Note that the propagaton wndow szes n each method have been fne-tuned based on the FSD data. M.A.P 100% 80% 60% Baselne Flat_Wn Shot_Gauss_Local Tme_Gauss_Local 40% 0% 0% Lewnsky Jordan Yeltsn Starr Albrght Gnsburg Pope Mccartney Gates Dana Malone Netanyahu Kendall Hllary Arafat Kohl Greenspan Suharto Jang Laden Fg. 3. Performance comparson of three propagaton methods wth baselne method As shown n Fg.3, n all the 0 queres at least one propagaton approach outperforms the baselne, and for 15 queres among them, all the three propagaton approaches outperform the baselne. Ths suggests that score propagaton based on tmng nformaton can greatly help the task of people-fndng. Moreover, n 17 out of the 0 queres, the tme-based Gaussan approach s the best performer, whose average MAP (0.40) s much hgher than that of the flat-wndow approach (0.9) and shot-based Gaussan approach (0.8). Thus, tme-based Gaussan s a better propagaton strategy than the other two, mplyng that tme s a better dstance measure than shot offset w.r.t. revealng the tmng between names and people. Fg.4 shows the average MAP (over 0 queres) of the tme-based Gaussan method usng local, global, and combned dstrbuton respectvely, n comparson to that of baselne and flat-wndow approach. As shown, the approach wth combned dstrbuton outperforms the global one by %, whch beats the local one by another %, and all are about twce the performance of the baselne approach.

Average M.A.P 60% 50% 40% Baselne Flat_Wn Tme_Gauss_Local Tme_Gauss_Global Tme_Channel_Global Tme_Gauss_Combned Tme_Gauss_Combned + Vsual Info. 30% 0% 10% 0% All names Frequent names Infrequent names Fg. 4. Performance comparson of local, global, combned dstrbuton wth vsual nformaton The three types of dstrbuton cause more nterestng dscrepancy on the performance of fndng frequently occurrng people versus that of fndng nfrequent ones. Here frequent people are those who appear vsually 0+ tmes n both FSD and FST (cf. Table 1), whle nfrequent ones are those appearng 0- tmes n both FSD and FST. By ths standard, there are 7 frequent and 8 nfrequent people among the 0 people, whle 5 people cannot be clearly classfed due to ther unbalanced appearances n FST and FSD. As we can see, for frequent names the choce of dstrbutons does not have any sgnfcant nfluence on the performance, whle for nfrequent names the dfference s substantal. Specfcally, for all the 7 frequent people, the MAP of global dstrbuton never dffers from that of local dstrbuton by over 10%, whle for 5 out of the 8 nfrequent ones, global dstrbuton enhances the MAP by over 0%. Ths echoes our observaton n Sect..1 that the dstrbuton of frequent names s smlar to each other and thus to the global one, whch s domnated by the dense tranng data of frequent people. Therefore, the performance of fndng such people s almost unaffected by the choce of dstrbuton. For nfrequent people, snce ther local dstrbuton s poorly estmated usng ther nsuffcent tranng data, the performance can beneft from usng the more stable global dstrbuton. It s nterestng to see that the combned dstrbuton s better than the global one, whch mples that each name has a unque "true" dstrbuton that les between the global and the local one. However, ths concluson can be challenged due to nsuffcent queres (8 nfrequent names) and the small mprovement (about 4%). Snce our data consst both ABC and CNN news, t s nterestng to know f these two channels have dfferent styles that lead to dfferent dstrbutons. Thus, we tran two channel-specfc global dstrbutons on FSD and test them on FST. As shown n Fg.4, ths approach (Tme_Channel_Global) mproves MAP over the unform global dstrbuton by only 1%, suggestng that ABC and CNN have smlar edtng styles. Fnally, we combne transcrpt search wth tme-based smoothed dstrbuton and vson nformaton. The combnaton weghts we traned from the held-out set are 1.0 for transcrpt nformaton, -0.81 for anchor flterng and 0.087 for face recognton. These weghts reflect the fact that face recognton s very unrelable, whle the anchor detecton has the ablty to remove false postves. As shown n Fg.4, combnng transcrpt wth vsual nformaton gave another 3% mprovement, whch s

manly derved from anchor detecton. Among the 0 people, the vsual nformaton enhances the MAP on 4 people substantally (over 0%), and we fnd that they all appear wth frontal faces n the vdeo. 10 people have mnor mprovement (1%-0%) on ther MAP wth vsual nformaton, whle the rest 6 people do not mprove at all. 5 Concluson In ths paper, we address the task of fndng a person usng clues ncludng transcrpt, vdeo structure, and vson nformaton. Gaussan dstrbuton has been proved expermentally an effectve model to descrbe the tmng pattern between a person's vsual appearances and the occurrences of hs/her names. Specfcally, a "smoothed" Gaussan dstrbuton estmated usng both the local and global tranng data produces the best performance, especally for nfrequently appearng people. Fnally, combng vsual nformaton such as face recognton and anchor detecton wth transcrpt nformaton brngs addtonal beneft to the person-fndng task. 6 Acknowledgement Ths research s partally supported by the Advanced Research and Development Actvty (ARDA) under contract # MDA908-00-C-0037 and MDA904-0-C-0451. Reference 1. Smeulders, et al.: Content-Based Image Retreval at the End of the Early Years. IEEE Trans. Pattern Analyss and Machne Intellgence, Vol, No 1 (000) 1349-1379.. Zhang, H.J, Kankanhall, A., Smolar, S.W. Automatc parttonng of full-moton vdeo, ACM Multmeda Systems, 1(1), 1993. 3. Hauptmann, A., et al. Informeda at TRECVID 003: Analyzng and Searchng Broadcast News Vdeo, Proceedngs of TREC 003, (003). 4. Satoh, S. and Kanade, K.: NAME-IT: Assocaton of Face and Name n Vdeo. IEEE Computer Socety Conf. on Computer Vson and Pattern Recognton (1997) 775-781. 5. The NIST TREC Vdeo Retreval Evaluaton, http://www-nlpr.nst.gov/projects/trecvd/. 6. Salton, G., McGll, M.: Introducton to Modern Informaton Retreval. McGraw-Hll, New- York (1983). 7. Baeza-Yates, R. and Rbero-Neto, N.: Modern Informaton Retreval. Addson Wesley, Essex, England (1999). 8. Zha, C. and Lafferty, J.: A Study of Smoothng Methods for Language Models Appled to Ad Hoc Informaton Retreval. Proc. 4th Int'l ACM SIGIR Conf. (001): pp. 334-34. 9. Pentland, A., Moghaddam, B., and Starne,r T.: Vew-Based and Modular Egenspaces for Face Recognton IEEE Conference on Computer Vson & Pattern Recognton (1994). 10. Schnederman, H. and Kanade T., Object Detecton Usng the Statstcs of Parts, Internatonal Journal of Computer Vson 003. 11. Chen, M.Y., Hauptmann, A., "Searchng for a Specfc Person n Broadcast News Vdeo," Int'l Conf. on Acoustcs, Speech, and Sgnal Processng, May, 004 (to appear).