CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mnng Jure Leskovec and Anand Rajaraman Stanford Unversty

HW3 s out Poster sesson s on last day of classes: Thu March 11 at 4:15 Reports are due March 14 Fnal s March 18 at 12:15 Open book, open notes No laptop 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 2

Whch h s best tlnear separator? + + + + + + Data: Examples: (x 1, y 1 ), (x n, y n ) Example : x =(x (1) 1,, x (d) 1 ) y {1, {, +1} Inner product: x= 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 3

+ + + + + + + x=0 Confdence: fd =(x )y For alldataponts: = 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 4

Maxmze the margn: + + Good accordng to ntuton, theory & practce + max, + + + + s. t., y ( x ) + x=0 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 5

Canoncal hyperplanes: Projecton of x on plane x=0: x x 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 6

Maxmzng the margn: max, s. t., y ( x ) Equvalent: mn 2 s. t., y ( x ) 1 SVM th hard constrants 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 7

If data not separable ntroduce penalty 1 mn C # number of 2 s. t., y ( x ) 1 Choose C based on cross valdaton Ho to penalze mstakes? tk mstakes + + + + + + + x=0 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 8

Introduce slack varables : mn 1 C n, 0 2 1 + s. t., y Hnge loss: ( x ) 1 + + + + + + + x=0 For each datapont: If margn>1, don t care If margn<1, pay lnear penalty 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 9

SVM n the natural form arg mn Where: f f () n 1 ( ) C max{0,1 2 1 y ( x )} 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 10

Use quadratc solver: n 1 Use quadratc solver: Mnmze quadratc functon S bj tt l t t n x y t s C 1 ) ( 2 1 mn 1 0, Subject to lnear constrants Stochastc gradent descent: M x y t s 1 ) (,.. Mnmze: n x y C f )} ( max{0,1 2 1 ) ( Update: y x L f t t ), ( ) ( ' 1 2 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 11 f t t ) (

Example by Leon Bottou: Reuters RCV1 document corpus m=781k tranng examples, 23k test texamples d=50k features Tranng tme: 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 12

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 13

What f e subsample the dataset? SGD on full dataset vs. Conjugate gradent on n tranng examples 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 14

Need to choose learnng rate : t Leon suggests: L'( ) 1 t t Select small subsample Try varous rates Pck the one that t most reduces the loss Use for next 100k teratons on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 15

Stoppng crtera: Ho many teratons of SGD? Early stoppng th cross valdaton Create valdaton set Montor cost functon on the valdaton set Stop hen loss stops decreasng Early stoppng a pror Extract to dsjont subsamples A and B of tranng data Determne the number of epochs k by tranng on A, stop by valdatng on B Tran for k epochs on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 16

Kernel functon: K(x,x j ) = (x ) (x j ) Does the SVM kernel trck stll ork? Yes (but not thout a prce): Represent th ts kernel expanson: = (x ) Usually: dl()/d= (x j ) Then update at epoch t by combnng : t = (1 ) t + 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 17

[ShalevShartz et al. ICML 07] We had before: Can replace C th : 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 18

[ShalevShartz et al. ICML 07] A t = S At t = 1 Subgradent method Stochastc gradent Subgradent Projecton 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 19

[ShalevShartz et al. ICML 07] Choosng A =1 n t and a lnear kernel over R Theorem [ShalevShartz et al. 07]: Runtme requred for Pegasos to fnd accurate soluton th prob. >1 Runtme depends d on number of features n Does not depend on #examples m Depends on dffculty of problem ( and ) 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 20

SVM and structured output predcton Settng: Assume: Data s..d. from Gven: Tranng sample Goal: Fnd functon from nput space X to output Y Complex objects 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 21

Examples: Natural Language Parsng Gven a sequence of ords x, predct the parse tree y Dependences from structural constrants, snce y has to be a tree y S x The dog chased the cat NP VP NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 22

Approach: ve as multclass classfcaton task Every complex output s one class Problems: Exponentally many classes! Ho to predct effcently? Ho to learn effcently? Potentally huge model! Manageable number of features? y 1 V VP N S V VP Det NP N x The dog chased the cat y 2 NP Det N S V VP Det NP N y k VP Det N NP S V Det NP N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 23

Feature vector descrbes match beteen x and y Learn sngle eght vector and rank by Hardmargn optmzaton problem: 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 24

[Yue et al., SIGIR 07] Rankng: Gven a query x, predct a rankng y. Dependences beteen bt results (e.g. avod redundant hts) Loss functon over rankngs (e.g. AvgPrec) x y 1. KernelMachnes SVM 2. SVMLght 3. Learnng th h Kernels 4. SV Meppen Fan Club 5. Servce Master & Co. 6. School of Volunteer Management 7. SV Mattersburg Onlne 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 25

[Yue et al., SIGIR 07] Gven: a complete (eak) rankng of documents for a query Predct: rankng for the nput query and document set The true labelng s a rankng here the relevant documents are all ranked n the front, eg e.g., An ncorrect labelng s any other rankng, e.g., g, There are ntractable many rankngs, thus an ntractable t number of constrants! t 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 26

[Yue et al., SIGIR 07] Let x s a set of documents/query examples Let y denote a eak rankng (parse orderngs) y j {1, +1} j 2 SVM objectve functon: C Constrants t are df defned dfor each ncorrect rankng y over the set of documents x: 1 2 y' y : T ( y, x) T ( y', x) ( y, y') s the match beteen target and predcton 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 27

[Yue et al., SIGIR 07] Loss: Average precson s the average of the precson scores at the rank locatons of each relevant document. Ex: has average precson 1 3 1 1 2 3 3 5 0.76 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 28

[Yue et al., SIGIR 07] Maxmze: subject to: here: and: 1 2 2 y' ( y', x) C y : : rel j:! rel T T ( y, x ) ( y', x ) ( y, y') y' j ( x x ( y, y') 1 AvgPrec( y') j ) After learnng, predct by sortng on x 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 29

[Yue et al., SIGIR 07] Orgnal SVM Problem Exponental constrants t Most are domnated by a small set of mportant constrants Structural SVM Approach Repeatedly fnds the next most volated constrant untl set of constrants s a good approxmaton. 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 30

Input: REPEAT FOR Compute ENDFOR IF _ Fnd most volated constrant t Volated by more than? ENDIF optmze StructSVMover Add constrant to orkng set UNTIL has not changed durng teraton [Jo06] [JoFnYu08] 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 34

Cuttng plane algorthm: STEP 1: Solve the SVM objectve functon usng only the current orkng set of constrants STEP 2: Usng the model learned n STEP 1, fnd the most volated constrant from the exponental set of constrants STEP 3: If the constrant returned n STEP 2 s more volated than the most volated constrant the orkng set by some small constant, add that constrant to the orkng set Repeat STEP 13 untl no addtonal constrants are added. Return the most recent model that as traned n STEP 1. STEP 13 s guaranteed to loop for at most a polynomal number of teratons. [Tsochantards et al. 2005] 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 35

StructuralSVM SVM s an oracle frameork Requres subroutne for fndng the most volated constrant Dependents on the formulaton of loss functon and jont feature representaton Exponental number of constrants! Effcent algorthm n the case of optmzng Mean Avg. Prec. (MAP): MAP s nvarant on the order of documents thn a relevance class 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 36

[Yue et al., SIGIR 07] T T H ( y'; ) ( y, y') y' ( x x ) y j : rel j:! rel Observaton: MAP s nvarant on the order of documents thn a relevance class Sappng to relevant or nonrelevant documents does not change MAP. Jont SVM score s optmzed by sortng by document score, x j Reduces to fndng an nterleavng beteen tosorted lstsofdocuments 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 37

( y'; ) ( y, y') T T H y' ( x x ) y j : rel j:! rel j Start th perfect rankng Consder sappng adjacent relevant/nonrelevant documents 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 38

( y'; ) ( y, y') T T H y' ( x x ) y j : rel j:! rel j Start th perfect rankng Consder sappng adjacent relevant/nonrelevant documents Fnd the best feasble rankng of the nonrelevant document Repeat for next nonrelevant document 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 40

( y'; ) ( y, y') T T H y' ( x x ) y j : rel j:! rel j Start th perfect rankng Consder sappng adjacent relevant/nonrelevant documents Fnd the best feasble rankng of the nonrelevant document Repeat for next nonrelevant document Never ant to sap past prevous nonrelevant document 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 41

( y'; ) ( y, y') T T H y' ( x x ) y j : rel j:! rel j Start th perfect rankng Consder sappng adjacent relevant/nonrelevant documents Fnd the best feasble rankng of the nonrelevant document Repeat for next nonrelevant document Never ant to sap past prevous nonrelevant document Repeat untl all nonrelevant documents have been consdered 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 42

SVM Formulaton SVMs optmze a tradeoff beteen model complexty and MAP loss Exponental number of constrants (one for each ncorrect rankng) Structural SVMs fnds a small subset of mportant constrants Requres subprocedure to fnd most volated constrant Fnd Most Volated Constrant Loss functon nvarant to reorderng of relevant documents SVM score mposes an orderng of the relevant documents Fndng nterleavng of to sorted lsts Loss functon has certan monotonc propertes Effcent algorthm 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mnng 43