Document Content-Based Search Using Topic Modeling

Similar documents
Unsupervised Clustering of EO-1 ALI Panchromatic Data Using Multilevel Local Pattern Histograms and Latent Dirichlet Allocation Classification

Large Scale Topic Detection using Node-Cut Partitioning on Dense Weighted-Graphs

Drum Transcription Based on Independent Subspace Analysis

Exploring the Political Agenda of the Greek Parliament Plenary Sessions

Image Classification (Decision Rules and Classification)

A Decision Support System for Inbound Marketers: An Empirical Use of Latent Dirichlet Allocation Topic Model to Guide Infographic Designers

GE 113 REMOTE SENSING

Semantic Localization of Indoor Places. Lukas Kuster

Coursework 2. MLP Lecture 7 Convolutional Networks 1

Reversible data hiding based on histogram modification using S-type and Hilbert curve scanning

Sentiment Analysis of User-Generated Contents for Pharmaceutical Product Safety

Chapter 2 Channel Equalization

Raster is faster but vector is corrector

Realizing Strategies for winning games. Senior Project Presented by Tiffany Johnson Math 498 Fall 1999

STRATEGY AND COMPLEXITY OF THE GAME OF SQUARES

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

Improved Draws for Highland Dance

MULTISPECTRAL IMAGE PROCESSING I

Blind fault detection using spectral signatures

Land Cover Analysis to Determine Areas of Clear-cut and Forest Cover in Olney, Montana. Geob 373 Remote Sensing. Dr Andreas Varhola, Kathry De Rego

Dynamic Fair Channel Allocation for Wideband Systems

Applications of Music Processing

Satellite image classification

Radio Deep Learning Efforts Showcase Presentation

Optimization Techniques for Alphabet-Constrained Signal Design

An Experiment-Based Quantitative and Comparative Analysis of Target Detection and Image Classification Algorithms for Hyperspectral Imagery

CHANGE DETECTION BY THE IR-MAD AND KERNEL MAF METHODS IN LANDSAT TM DATA COVERING A SWEDISH FOREST REGION

Color Constancy Using Standard Deviation of Color Channels

A Novel Approach to Separation of Musical Signal Sources by NMF

Image Analysis based on Spectral and Spatial Grouping

Classification of Analog Modulated Communication Signals using Clustering Techniques: A Comparative Study

COLOR IMAGE SEGMENTATION USING K-MEANS CLASSIFICATION ON RGB HISTOGRAM SADIA BASAR, AWAIS ADNAN, NAILA HABIB KHAN, SHAHAB HAIDER

VARIABLE-RATE STEGANOGRAPHY USING RGB STEGO- IMAGES

Permutation Tests in MDS

Frequency Domain Median-like Filter for Periodic and Quasi-Periodic Noise Removal

Audio Imputation Using the Non-negative Hidden Markov Model

Application of GIS to Fast Track Planning and Monitoring of Development Agenda

Attribute Based Specification, Comparison And Selection Of A Robot

An improved strategy for solving Sudoku by sparse optimization methods

Learning to rank search results

Composite Fractional Power Wavelets Jason M. Kinser

Global Journal of Engineering Science and Research Management

Improved Compressive Sensing of Natural Scenes Using Localized Random Sampling

DISCRETE FOURIER TRANSFORM AND FILTER DESIGN

SPARSE CHANNEL ESTIMATION BY PILOT ALLOCATION IN MIMO-OFDM SYSTEMS

Shuffling with ordered cards

On the use of synthetic images for change detection accuracy assessment

Understanding User Privacy in Internet of Things Environments IEEE WORLD FORUM ON INTERNET OF THINGS / 30

Transmit Antenna Selection in Linear Receivers: a Geometrical Approach

SSB Debate: Model-based Inference vs. Machine Learning

Lecture 3 - Regression

Mining Technical Topic Networks from Chinese Patents

A Sphere Decoding Algorithm for MIMO

ROTATIONAL RESET STRATEGY FOR ONLINE SEMI-SUPERVISED NMF-BASED SPEECH ENHANCEMENT FOR LONG RECORDINGS

CS 229 Final Project: Using Reinforcement Learning to Play Othello

Moment-Based Automatic Modulation Classification: FSKs and Pre-Matched-Filter QAMs. Darek Kawamoto, Bob McGwier VT Hume Center HawkEye 360

Business Statistics:

Hyperspectral image processing and analysis

Precoding Based Waveforms for 5G New Radios Using GFDM Matrices

A Review on Image Fusion Techniques

Permutations. = f 1 f = I A

A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Rhythmic Similarity -- a quick paper review. Presented by: Shi Yong March 15, 2007 Music Technology, McGill University

Applications of Linear Algebra in Signal Sampling and Modeling

Chapter 2: Signal Representation

LAB MANUAL SUBJECT: IMAGE PROCESSING BE (COMPUTER) SEM VII

Search results fusion

Scheduling Doctors to Clinical and Surgical Time Slots: A Column Generation Approach

Sparse Statistical Analysis of Online News

Module 7-4 N-Area Reliability Program (NARP)

Computer Vision. Intensity transformations

Spatial-Temporal Data Mining in Traffic Incident Detection

Semi-Automatic Antenna Design Via Sampling and Visualization

EE359 Discussion Session 8 Beamforming, Diversity-multiplexing tradeoff, MIMO receiver design, Multicarrier modulation

Community Detection and Labeling Nodes

DIAGNOSIS OF STATOR FAULT IN ASYNCHRONOUS MACHINE USING SOFT COMPUTING METHODS

Channel selection for IEEE based wireless LANs using 2.4 GHz band

University of Technology Building & Construction Department / Remote Sensing & GIS lecture

Privacy preserving data mining multiplicative perturbation techniques

The D-Day Landing On Gold Beach: 6 June 1944 (Bloomsbury Studies In Military History) [Digital] By Andrew Holborn READ ONLINE

Speaker and Noise Independent Voice Activity Detection

Enhancing Red Tide Image Recognition using Semantic Feature and Rotation of Algae Image Angle

CCO Commun. Comb. Optim.

Singing Voice Detection. Applications of Music Processing. Singing Voice Detection. Singing Voice Detection. Singing Voice Detection

Supervisory Control for Cost-Effective Redistribution of Robotic Swarms

신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일

Partially Overlapped Channel Assignment for Multi-Channel Wireless Mesh Networks

Finite Mathematics MAT 141: Chapter 8 Notes

USING LANDSAT MULTISPECTRAL IMAGES IN ANALYSING FOREST VEGETATION

We Know Where You Are : Indoor WiFi Localization Using Neural Networks Tong Mu, Tori Fujinami, Saleil Bhat

Detection, Recognition, and Localization of Multiple Cyber/Physical Attacks through Event Unmixing

Graphs of Tilings. Patrick Callahan, University of California Office of the President, Oakland, CA

LDPC Decoding: VLSI Architectures and Implementations

A New Forecasting System using the Latent Dirichlet Allocation (LDA) Topic Modeling Technique

Joint Transmitter-Receiver Adaptive Forward-Link DS-CDMA System

POLICY SIMULATION AND E-GOVERNANCE

Digital Image Processing. Lecture # 6 Corner Detection & Color Processing

PERFORMANCE OF POWER DECENTRALIZED DETECTION IN WIRELESS SENSOR SYSTEM WITH DS-CDMA

AN IMPROVED NEURAL NETWORK-BASED DECODER SCHEME FOR SYSTEMATIC CONVOLUTIONAL CODE. A Thesis by. Andrew J. Zerngast

Reduction of PAR and out-of-band egress. EIT 140, tom<at>eit.lth.se

Transcription:

Document Content-Based Search Using Topic Modeling Jason Bello, Brian de Silva, Jerry Luo University of California, Los Angeles August 9, 2013 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 1 / 42

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 2 / 42

Sidewinder Documents Converted to Text Original Sidewinder Document Converted from Image to Text Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 3 / 42

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search Searching with entire document for documents with similar document More useful than keyword search for an unsupervised problem Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42

Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search Searching with entire document for documents with similar document More useful than keyword search for an unsupervised problem Limitations of current search Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 4 / 42

Using keyword search on a document Table : Search Results for Cincinatti Reds Recap 7/5 Document Description Date Cincinatti Reds Recap 7/5 Cincinatti Reds Recap 6/23 Cincinatti Reds Recap 6/23 Toronto Blue Jays Recap 6/31 Toronto Blue Jays Recap 7/3 Minnesota Twins Recap 7/25 Cincinatti Reds Recap 8/13 Minnesota Twins Recap 7/30 Toronto Blue Jays Recap 7/23 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 5 / 42

Methodology Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 6 / 42

Converting a Corpus of Documents into a Matrix Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42

Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42

Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42

Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42

Converting a Corpus of Documents into a Matrix Bag-of-Words Removes most common words, e.g. the, and, because Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) Diminishes the weight of words that occur frequently throughout the corpus and adds weight to those that occur rarely Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 7 / 42

Histogram Matrix X = Documents x 11... x 1n. Words.... x m1... x mn Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 8 / 42

Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42

Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Doc i = v i1 Topic 1 + v i2 Topic 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42

Topic Modeling Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. [Blei] Doc i = h i1 Word 1 + h i2 Word 2 +... Doc i = v i1 Topic 1 + v i2 Topic 2 +... Topic i = u i1 Word 1 + u i2 Word 2 +... Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 9 / 42

Topic Modeling Methods Latent Dirichlet Allocation (LDA) 1 (computationally expensive) Nonnegative Matrix Factorization (NMF) 2 1 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003): 993-1022. 2 Seung, D., and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information processing systems 13 (2001): 556-562. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 10 / 42

Nonnegative Matrix Factorization min X UV T U,V Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 11 / 42

Nonnegative Matrix Factorization min X UV T U,V This reduces to a typical constrained optimization problem. The constraints are U = [U] +, V = [V ] +. When solved we get the following matrices: Documents Words X Words Topics U Topics Documents ( ) V T Nonnegativity gives meaning to the concept of topics. [Seung, 2001] Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 11 / 42

Similarity Measures Euclidean Based Similarity Let u, v R n and let u (i) and v (j) denote different histogram vectors in the corpus u v 1 max u (i) v (j) j Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 12 / 42

Similarity Measures Euclidean Based Similarity Let u, v R n and let u (i) and v (j) denote different histogram vectors in the corpus u v 1 max u (i) v (j) j Cosine Similarity Let u, v R n cos θ = u v u v Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 12 / 42

Test Documents Discourse on the Method by Rene Descartes Articles about the Invasion of Normandy Baseball game recaps for the Cincinnati Reds, the Minnesota Twins, and the Toronto Blue Jays Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 13 / 42

Current Research Progress - Search Methods Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 14 / 42

Example Topics Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 blue jays toronto hit second time runs good three single reds hit second season cincinnati pirates innings three games game twins runs innings game minnesota inning three third start hit invasion allied june german troops normandy british landing france beaches heart blood vein veins artery arteries motion cavity small body truth nature reason god will objects thought men place opinions Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 15 / 42

Documents as Linear Combinations of Topics 40 test documents Dark squares indicate a strong presence of a given topic in a document ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 16 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). (4) Compute similarity between search document and other documents in the corpus. ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Method Overview (1) Compute histogram for each document in corpus. (a) Apply TF-IDF (optional). (2) Apply Nonnegative Matrix Factorization to histogram matrix. (3) Reweight topic vectors (new research!). (4) Compute similarity between search document and other documents in the corpus. (a) List documents in order of descending similarity. ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 17 / 42

Content-Based Search Results Histogram Similarity Document Similarity cinc3.txt 1.0000 cinc2.txt 0.7772 cinc4.txt 0.7729 cinc9.txt 0.7569 minn9.txt 0.7470 toronto7.txt 0.7468 cinc7.txt 0.7428 minn7.txt 0.7419 dotm14.txt 0.7406 cinc6.txt 0.7367 WW2 8.txt 0.7361 minn2.txt 0.7358 toronto5.txt 0.7300 Topic Similarity Document Similarity cinc3.txt 1.0000 cinc9.txt 0.9991 cinc1.txt 0.9981 cinc10.txt 0.9980 cinc4.txt 0.9972 cinc2.txt 0.9970 cinc5.txt 0.9959 cinc7.txt 0.9940 cinc8.txt 0.9929 cinc6.txt 0.9905 WW2 8.txt 0.9001 dotm18.txt 0.8996 toronto1.txt 0.8993 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 18 / 42

Disadvantage of Topic Search Searching based on similiarity of each document s topic vectors will do well to find documents of similar topic compositions, but does not take into account the difference or similarity between distinct topics. Document dotm5.txt 1 dotm2.txt 0.9997...... dotm12.txt 0.9664 WW2 8.txt 0.8924...... toronto4.txt 0.7027 dotm11.txt 0.7009...... dotm14.txt 0.0212 Similarity Index to dotm5.txt Table : Topic Search Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 19 / 42

Affinity Topic Matrix Weighting Define A, the affinity topic matrix of U by A ij = e U i U j 2 σ AV T is a reweighting of V T using similarity between topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 20 / 42

Gram Matrix Weighting Define G, the Gram topic matrix of U by G ij = U i, U j For dot product, G = U T U GV T is a reweighting of V T using orthogonality between topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 21 / 42

Affinity Matrix Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 22 / 42

Gram Matrix Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 23 / 42

Search Results Using 8 Topics Search Results Topic Vector Affinity Reweighting Gram Reweighting cinc3.txt cinc3.txt cinc3.txt cinc9.txt cinc9.txt cinc9.txt cinc1.txt cinc5.txt cinc5.txt cinc10.txt cinc10.txt cinc10.txt cinc4.txt cinc1.txt cinc1.txt cinc2.txt cinc7.txt cinc7.txt cinc5.txt cinc8.txt cinc8.txt cinc7.txt cinc2.txt cinc2.txt cinc8.txt cinc6.txt cinc6.txt cinc6.txt cinc4.txt cinc4.txt WW2 8.txt minn6.txt minn6.txt WW2 6.txt minn1.txt minn1.txt toronto1.txt minn7.txt minn7.txt ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 24 / 42

Topic Matrix Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 25 / 42

Gram Matrix Re-Weighting: Purity Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 26 / 42

Affinity Matrix Re-Weighting Purity Vs Number of Topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 27 / 42

Search Results Using 45 Topics Search Results Topic Vector Affinity Reweighting Gram Reweighting cinc3.txt cinc3.txt cinc3.txt WW2 6.txt cinc4.txt cinc6.txt cinc6.txt cinc6.txt cinc8.txt dotm1.txt cinc2.txt cinc2.txt WW2 8.txt cinc8.txt cinc10.txt dotm15.txt cinc5.txt cinc4.txt dotm18.txt cinc1.txt cinc5.txt WW2 2.txt cinc10.txt toronto1.txt dotm10.txt minn2.txt cinc9.txt dotm8.txt cinc9.txt cinc1.txt WW2 9.txt toronto1.txt minn7.txt toronto5.txt cinc7.txt minn6.txt WW2 1.txt minn3.txt cinc7.txt ason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 28 / 42

Sidewinder Document Results Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 29 / 42

Sidewinder Modified Topic Vectors Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 30 / 42

Search Results on Sidewinder Documents Search Results Gram Matrix Modification (using doc13.txt) Document Similarity Description doc13.txt 1.0000 Design and development of a Fuze Triggering Device test machine doc22.txt 0.5251 Sidewinder Fuze Triggering Device evaluation doc08.txt 0.4283 Developmental Program Plan For the Fuze Triggering Device for AIM-9L Missile System doc23.txt 0.3073 Wing assembly, Studies of the sidewinder 1c aeromechanics, structures and loads doc21.txt 0.3067 Results of evaluation of contact-delay self-destruct modules for use in the sidewinder missile doc05.txt 0.3045 Military Specification Test Set, AIM-9H/L Missile Guidance Control Section doc03.txt 0.3012 Document Control Plan for AIM-9H/AIM-9L Missile Production doc14.txt 0.2968 Test report of diagnostic and safety testing of the WDU/9B Sidewinder Exercise Warhead doc04.txt 0.2965 Development and Evaluation of MK 16 Mod 0 Guided Missile Cradle for Sidewinder 1C Missiles doc19.txt 0.2935 (Letter) Version numbers for rockers and guided missiles; assignment of Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 31 / 42

Search Results on Sidewinder Documents Search Results Adjacency Matrix Modification (using doc13.txt) Document Similarity Description doc13.txt 1.0000 Design and development of a Fuze Triggering Device test machine doc22.txt 0.3652 Sidewinder Fuze Triggering Device evaluation doc08.txt 0.2430 Developmental Program Plan For the Fuze Triggering Device for AIM-9L Missile System doc23.txt 0.0887 Wing assembly, Studies of the sidewinder 1c aeromechanics, structures and loads doc21.txt 0.0784 Results of evaluation of contact-delay self-destruct modules for use in the sidewinder missile doc05.txt 0.0775 Military Specification Test Set, AIM-9H/L Missile Guidance Control Section doc03.txt 0.0720 Document Control Plan for AIM-9H/AIM-9L Missile Production doc14.txt 0.0656 Test report of diagnostic and safety testing of the WDU/9B Sidewinder Exercise Warhead doc04.txt 0.0636 Development and Evaluation of MK 16 Mod 0 Guided Missile Cradle for Sidewinder 1C Missiles doc02.txt 0.0615 F-43 - AIM-9 System Modifcation Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 32 / 42

Ongoing and Future Works Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 33 / 42

Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42

Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. (2) Compute the spectrum of the affinity matrix. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42

Search using Spectral Embedding of Topic Vectors (1) Create affinity matrix for documents using their topic vectors. (2) Compute the spectrum of the affinity matrix. (3) Use the spectrum to compute the distance from the chosen document. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 34 / 42

Search Results using Spectral Embedding of Topic Vector Cosine Similarity Document Similarity cinc3.txt 1.0000 cinc4.txt 0.9994 cinc7.txt 0.9968 cinc2.txt 0.9901 cinc6.txt 0.9793 cinc10.txt 0.9768 cinc1.txt 0.9755 cinc8.txt 0.9751 cinc9.txt 0.9731 cinc5.txt 0.9702 minn4.txt 0.9277 minn6.txt 0.9222 minn7.txt 0.9183 Euclidean Based Similarity Document Similarity cinc3.txt 1.0000 cinc1.txt 0.9514 cinc9.txt 0.9462 cinc5.txt 0.9340 cinc2.txt 0.9329 cinc4.txt 0.9300 cinc7.txt 0.9282 cinc10.txt 0.9141 cinc8.txt 0.9136 cinc6.txt 0.9134 dotm18.txt 0.6994 dotm1.txt 0.6881 dotm9.txt 0.6853 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 35 / 42

Comparisons Average Scores (with initializing k-means) Purity Inverse Purity k-means on Random Subspace 0.4566 0.5940 k-means on Histogram 1.0000 1.0000 Spectral Clustering on Histogram 0.8197 0.9672 k-means on Topics (V T ) 0.9920 0.9969 Spectral Clustering on Topics (V T ) 0.9962 0.9974 k-means on Gram re-weighting (GV T ) 0.9950 0.9953 k-means on Affinity re-weighting (AV T ) 0.9892 0.9924 Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 36 / 42

Applying Semi-Supervised Search Methods to Sidewinder Documents Applying k-means to search results Only show documents in the same cluster as the search document. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 37 / 42

Applying Semi-Supervised Search Methods to Sidewinder Documents Applying k-means to search results Only show documents in the same cluster as the search document. Initializing U Replace initial column vectors of U with histogram vectors of documents. Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 37 / 42

Topics with/without Initialization Topics Before Initializing U Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 hit blue jays heart truth invasion innings jays blood nature allied runs toronto vein reason june twins hit veins god german game time artery will troops second second arteries objects normandy three good motion thought british Topics After Initializing U Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 blue twins heart reds invasion jays runs blood hit allied toronto innings vein season june hit game artery second german second inning veins cincinnati troops time minnesota arteries pirates normandy good three nature innings british Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 38 / 42

Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42

Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Analysis of our new weighting methods Explore mathematics behind the consistency of methods despite difference in number of topics Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42

Future Directions Issues with Purity Necessary but not sufficient Find a way to quantitatively evaluate the search results directly Analysis of our new weighting methods Explore mathematics behind the consistency of methods despite difference in number of topics Submit to mathematical journal Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 39 / 42

Questions? Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 40 / 42

Acknowledgements Dr. Blake Hunter, and Dr. Theodore Kolokolnikov for useful advice and guidance Arjuna Flenner and China Lake Navy Research Lab UCLA Department of Mathematics, BUGS, PIC Lab and Dr. Bertozzi for organizing a great research program Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 41 / 42

The End Jason Bello, Brian de Silva, Jerry Luo (UCLA) Topic Modeling August 9, 2013 42 / 42