Document Content-Based Search Using Topic Modeling Jason Bello, Brian de Silva, Jerry Luo University of California, Los Angeles August 9, 2013 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 1 / 42
Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 2 / 42
Sidewinder Documents Converted to Text Original Sidewinder Document Converted from Image to Text Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 3 / 42
Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42
Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42
Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search Searching with entire document for documents with similar document More useful than keyword search for an unsupervised problem Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42
Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search Searching with entire document for documents with similar document More useful than keyword search for an unsupervised problem Limitations of current search Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42
Using keyword search on a document Table : Search Results for Cincinatti Reds Recap 7/5 Document Description Date Cincinatti Reds Recap 7/5 Cincinatti Reds Recap 6/23 Cincinatti Reds Recap 6/23 Toronto Blue Jays Recap 6/31 Toronto Blue Jays Recap 7/3 Minnesota Twins Recap 7/25 Cincinatti Reds Recap 8/13 Minnesota Twins Recap 7/30 Toronto Blue Jays Recap 7/23 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 5 / 42
Methodology Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 6 / 42
Converting a Corpus of Documents into a Matrix Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42
Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42
Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42
Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42
Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) Diminishes the weight of words that occur frequently throughout the corpus and adds weight to those that occur rarely Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42
Histogram Matrix X = Documents x 11... x 1n. Words.... x m1... x mn Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 8 / 42
Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42
Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Doc i = v i1 Topic 1 + v i2 Topic 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42
Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Doc i = v i1 Topic 1 + v i2 Topic 2 +... Topic i = u i1 Word 1 + u i2 Word 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42
Topic Modeling Methods Latent Dirichlet Allocation (LDA) 1 (computationally expensive) Nonnegative Matrix Factorization (NMF) 2 1 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003): 993-1022. 2 Seung, D., and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information processing systems 13 (2001): 556-562. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 10 / 42
Nonnegative Matrix Factorization min X UV T U,V Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 11 / 42
Nonnegative Matrix Factorization min X UV T U,V This reduces to a typical constrained optimization problem. The constraints are U = [U] +, V = [V ] +. When solved we get the following matrices: Documents Words X Words Topics U Topics Documents ( ) V T Nonnegativity gives meaning to the concept of topics. [Seung, 2001] Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 11 / 42
Similarity Measures Euclidean Based Similarity Let u, v R n and let u (i) and v (j) denote different histogram vectors in the corpus u v 1 max u (i) v (j) j Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 12 / 42
Similarity Measures Euclidean Based Similarity Let u, v R n and let u (i) and v (j) denote different histogram vectors in the corpus u v 1 max u (i) v (j) j Cosine Similarity Let u, v R n cos θ = u v u v Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 12 / 42
Test Documents Discourse on the Method by Rene Descartes Articles about the Invasion of Normandy Baseball game recaps for the Cincinnati Reds, the Minnesota Twins, and the Toronto Blue Jays Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 13 / 42
Current Research Progress - Search Methods Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 14 / 42
Example Topics Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 blue jays toronto hit second time runs good three single reds hit second season cincinnati pirates innings three games game twins runs innings game minnesota inning three third start hit invasion allied june german troops normandy british landing france beaches heart blood vein veins artery arteries motion cavity small body truth nature reason god will objects thought men place opinions Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 15 / 42
Documents as Linear Combinations of Topics 40 test documents Dark squares indicate a strong presence of a given topic in a document ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 16 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). (4) Compute similarity between search document and other documents in the corpus. ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). (4) Compute similarity between search document and other documents in the corpus. (a) List documents in order of descending similarity. ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42
Content-Based Search Results Histogram Similarity Document Similarity cinc3.txt 1.0000 cinc2.txt 0.7772 cinc4.txt 0.7729 cinc9.txt 0.7569 minn9.txt 0.7470 toronto7.txt 0.7468 cinc7.txt 0.7428 minn7.txt 0.7419 dotm14.txt 0.7406 cinc6.txt 0.7367 WW2 8.txt 0.7361 minn2.txt 0.7358 toronto5.txt 0.7300 Topic Similarity Document Similarity cinc3.txt 1.0000 cinc9.txt 0.9991 cinc1.txt 0.9981 cinc10.txt 0.9980 cinc4.txt 0.9972 cinc2.txt 0.9970 cinc5.txt 0.9959 cinc7.txt 0.9940 cinc8.txt 0.9929 cinc6.txt 0.9905 WW2 8.txt 0.9001 dotm18.txt 0.8996 toronto1.txt 0.8993 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 18 / 42
Disadvantage of Topic Search Searching based on similiarity of each document s topic vectors will do well to find documents of similar topic compositions, but does not take into account the difference or similarity between distinct topics. Document dotm5.txt 1 dotm2.txt 0.9997...... dotm12.txt 0.9664 WW2 8.txt 0.8924...... toronto4.txt 0.7027 dotm11.txt 0.7009...... dotm14.txt 0.0212 Similarity Index to dotm5.txt Table : Topic Search Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 19 / 42
Affinity Topic Matrix Weighting Define A, the affinity topic matrix of U by A ij = e U i U j 2 σ AV T is a reweighting of V T using similarity between topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 20 / 42
Gram Matrix Weighting Define G, the Gram topic matrix of U by G ij = U i, U j For dot product, G = U T U GV T is a reweighting of V T using orthogonality between topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 21 / 42
Affinity Matrix Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 22 / 42
Gram Matrix Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 23 / 42
Search Results Using 8 Topics Search Results Topic Vector Affinity Reweighting Gram Reweighting cinc3.txt cinc3.txt cinc3.txt cinc9.txt cinc9.txt cinc9.txt cinc1.txt cinc5.txt cinc5.txt cinc10.txt cinc10.txt cinc10.txt cinc4.txt cinc1.txt cinc1.txt cinc2.txt cinc7.txt cinc7.txt cinc5.txt cinc8.txt cinc8.txt cinc7.txt cinc2.txt cinc2.txt cinc8.txt cinc6.txt cinc6.txt cinc6.txt cinc4.txt cinc4.txt WW2 8.txt minn6.txt minn6.txt WW2 6.txt minn1.txt minn1.txt toronto1.txt minn7.txt minn7.txt ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 24 / 42
Topic Matrix Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 25 / 42
Gram Matrix Re-Weighting: Purity Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 26 / 42
Affinity Matrix Re-Weighting Purity Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 27 / 42
Search Results Using 45 Topics Search Results Topic Vector Affinity Reweighting Gram Reweighting cinc3.txt cinc3.txt cinc3.txt WW2 6.txt cinc4.txt cinc6.txt cinc6.txt cinc6.txt cinc8.txt dotm1.txt cinc2.txt cinc2.txt WW2 8.txt cinc8.txt cinc10.txt dotm15.txt cinc5.txt cinc4.txt dotm18.txt cinc1.txt cinc5.txt WW2 2.txt cinc10.txt toronto1.txt dotm10.txt minn2.txt cinc9.txt dotm8.txt cinc9.txt cinc1.txt WW2 9.txt toronto1.txt minn7.txt toronto5.txt cinc7.txt minn6.txt WW2 1.txt minn3.txt cinc7.txt ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 28 / 42
Sidewinder Document Results Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 29 / 42
Sidewinder Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 30 / 42
Search Results on Sidewinder Documents Search Results Gram Matrix Modification (using doc13.txt) Document Similarity Description doc13.txt 1.0000 Design and development of a Fuze Triggering Device test machine doc22.txt 0.5251 Sidewinder Fuze Triggering Device evaluation doc08.txt 0.4283 Developmental Program Plan For the Fuze Triggering Device for AIM-9L Missile System doc23.txt 0.3073 Wing assembly, Studies of the sidewinder 1c aeromechanics, structures and loads doc21.txt 0.3067 Results of evaluation of contact-delay self-destruct modules for use in the sidewinder missile doc05.txt 0.3045 Military Specification Test Set, AIM-9H/L Missile Guidance Control Section doc03.txt 0.3012 Document Control Plan for AIM-9H/AIM-9L Missile Production doc14.txt 0.2968 Test report of diagnostic and safety testing of the WDU/9B Sidewinder Exercise Warhead doc04.txt 0.2965 Development and Evaluation of MK 16 Mod 0 Guided Missile Cradle for Sidewinder 1C Missiles doc19.txt 0.2935 (Letter) Version numbers for rockers and guided missiles; assignment of Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 31 / 42
Search Results on Sidewinder Documents Search Results Adjacency Matrix Modification (using doc13.txt) Document Similarity Description doc13.txt 1.0000 Design and development of a Fuze Triggering Device test machine doc22.txt 0.3652 Sidewinder Fuze Triggering Device evaluation doc08.txt 0.2430 Developmental Program Plan For the Fuze Triggering Device for AIM-9L Missile System doc23.txt 0.0887 Wing assembly, Studies of the sidewinder 1c aeromechanics, structures and loads doc21.txt 0.0784 Results of evaluation of contact-delay self-destruct modules for use in the sidewinder missile doc05.txt 0.0775 Military Specification Test Set, AIM-9H/L Missile Guidance Control Section doc03.txt 0.0720 Document Control Plan for AIM-9H/AIM-9L Missile Production doc14.txt 0.0656 Test report of diagnostic and safety testing of the WDU/9B Sidewinder Exercise Warhead doc04.txt 0.0636 Development and Evaluation of MK 16 Mod 0 Guided Missile Cradle for Sidewinder 1C Missiles doc02.txt 0.0615 F-43 - AIM-9 System Modifcation Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 32 / 42
Ongoing and Future Works Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 33 / 42
Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42
Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. (2) Compute the spectrum of the affinity matrix. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42
Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. (2) Compute the spectrum of the affinity matrix. (3) Use the spectrum to compute the distance from the chosen document. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42
Search Results using Spectral Embedding of Topic Vector Cosine Similarity Document Similarity cinc3.txt 1.0000 cinc4.txt 0.9994 cinc7.txt 0.9968 cinc2.txt 0.9901 cinc6.txt 0.9793 cinc10.txt 0.9768 cinc1.txt 0.9755 cinc8.txt 0.9751 cinc9.txt 0.9731 cinc5.txt 0.9702 minn4.txt 0.9277 minn6.txt 0.9222 minn7.txt 0.9183 Euclidean Based Similarity Document Similarity cinc3.txt 1.0000 cinc1.txt 0.9514 cinc9.txt 0.9462 cinc5.txt 0.9340 cinc2.txt 0.9329 cinc4.txt 0.9300 cinc7.txt 0.9282 cinc10.txt 0.9141 cinc8.txt 0.9136 cinc6.txt 0.9134 dotm18.txt 0.6994 dotm1.txt 0.6881 dotm9.txt 0.6853 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 35 / 42
Comparisons Average Scores (with initializing k-means) Purity Inverse Purity k-means on Random Subspace 0.4566 0.5940 k-means on Histogram 1.0000 1.0000 Spectral Clustering on Histogram 0.8197 0.9672 k-means on Topics (V T ) 0.9920 0.9969 Spectral Clustering on Topics (V T ) 0.9962 0.9974 k-means on Gram re-weighting (GV T ) 0.9950 0.9953 k-means on Affinity re-weighting (AV T ) 0.9892 0.9924 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 36 / 42
Applying Semi-Supervised Search Methods to Sidewinder Documents Applying k-means to search results Only show documents in the same cluster as the search document. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 37 / 42
Applying Semi-Supervised Search Methods to Sidewinder Documents Applying k-means to search results Only show documents in the same cluster as the search document. Initializing U Replace initial column vectors of U with histogram vectors of documents. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 37 / 42
Topics with/without Initialization Topics Before Initializing U Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 hit blue jays heart truth invasion innings jays blood nature allied runs toronto vein reason june twins hit veins god german game time artery will troops second second arteries objects normandy three good motion thought british Topics After Initializing U Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 blue twins heart reds invasion jays runs blood hit allied toronto innings vein season june hit game artery second german second inning veins cincinnati troops time minnesota arteries pirates normandy good three nature innings british Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 38 / 42
Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42
Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Analysis of our new weighting methods Explore mathematics behind the consistency of methods despite difference in number of topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42
Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Analysis of our new weighting methods Explore mathematics behind the consistency of methods despite difference in number of topics Submit to mathematical journal Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42
Questions? Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 40 / 42
Acknowledgements Dr. Blake Hunter, and Dr. Theodore Kolokolnikov for useful advice and guidance Arjuna Flenner and China Lake Navy Research Lab UCLA Department of Mathematics, BUGS, PIC Lab and Dr. Bertozzi for organizing a great research program Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 41 / 42
The End Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 42 / 42