Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007) Qin Huazheng 2014/10/15

Outline Motivation Framework Experiments

Motivation Questioning the term independence assumption behind the traditional bag-of-word model, we propose a different representation of a document that captures the relationships between the terms using an unweighted directed graph of terms.

Motivation Semantically speaking, word order and word dependence do matter. Mary is quicker than John and John is quicker than Mary are clearly different phrases. Yet, their bag-of-word representation is the same. Nevertheless, the impact of the term order has been a popular issue and relationships between the terms in general is claimed to play an important role in text processing. This motivated us to find a representation that would capture these relationships while being as efficient as the traditional one at query time.

Outline Motivation Framework Experiments

Graph-of-word Framework We represent a textual document (typically a Web page)as a graph-of-word that corresponds to an unweighted directed graph whose vertices represent unique terms, whose edges represent co-occurrences between the terms within a fixed-size sliding window and whose edge direction represents term order.

Graph-of-word Example: Framework

TF-IDF and TW-IDF TF-IDF( ) Framework TW-IDF where tw(t, d) is the weight of the vertex associated with the term t in the graph-of-word representation of the document d. In the experiments of section 6, the weight is the indegree.

Outline Motivation Framework Experiments

Experiments

Thank you

Graph-of-word and TW-IDF: New Approach to Ad Hoc IR (CIKM 2013) Learning to Rank: From Pairwise Approach to Listwise Approach (ICML 2007)

Outline What is Learning to Rank? (L2R) Three Approaches of L2R The Listwise Approach: ListNet

What is learning to rank? Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The ranking model's purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is "similar" to rankings in the training data in some sense.

What is learning to rank? Learning to Rank Framework

Outline What is Learning to Rank? (L2R) Three Approaches of L2R The Listwise Approach: ListNet

Three approaches of L2R Pointwise Approach In this case it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then learning-to-rank problem can be approximated by a regression problem given a single query-document pair, predict its score. This approach does not consider the relationship between the documents.

Three approaches of L2R Pairwise Approach In this case learning-to-rank problem is approximated by a classification problem learning a binary classifier that can tell which document is better in a given pair of documents. The goal is to minimize average number of inversions in ranking.

Three approaches of L2R Pairwise Approach The objective of learning is formalized as minimizing errors in classification of document pairs, rather than minimizing errors in ranking of documents. The assumption of that the document pairs are generated i.i.d. is also too. The number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs

Three approaches of L2R Listwise Approach In the listwise approach, the document lists instead of document pairs are used as instances in learning.

Outline What is Learning to Rank? (L2R) Three Approaches of L2R The Listwise Approach: ListNet

The major question then is how to define a listwise loss function, rep resenting the difference between the ranking list output by a ranking model and the ranking list given as ground truth. T i = (d 1, s 1, d 2, s 2, d 3, s 3, d 4, s 4, d 5, s 5 } O i = (d 2, s 2, d 1, s 1, d 3, s 3, d 5, s 5, d 4, s 4 }

Notations A set of queries: Each query is associated with a list of documents: denotes the sizes of Each list of documents is associated with a list of judgments (scores):

Notations A feature vector: The list of features: The corresponding list of scores: The training set:

Notations Ranking function: The scores of a list of feature vectors: The total losses with respect to the training data: Where is a listwise loss function.

The Loss Function How about defining the loss function as MAP or ndcg? Namely: Or: = 1 MAP(y i, z (i) ) = 1 ndcg(y i, z (i) )

MAP(Mean Average Precision) AP: P(i) is the accepted proportion of top-i items and M is the total number of accepted items in all topm items.

MAP(Mean Average Precision) AP Example: z (i) y i A A C B D B C D AP = 1 1 + 1 2 0 + 1 3 0 + 1 2 1 + 3 5 1 3 E E

MAP(Mean Average Precision) MAP: MAP = 1 n n i=1 AP(y i, z (i) ) z (1) y 1 z (n) y n

ndcg(normailzed Discounted Cumulative Gain ) Z = m i=1 2 1 1 log 2 (i + 1) A C B D E z (i) A B C D E y i ndcg = 1 Z 1 log 2 (1+1) * 0 log 2 (2+1) 0 log 2 3+1 1 log 2 4+1 1 log 2 (5+1) Z = 1 * 1 1 1 1 log 2 (1+1) log 2 (2+1) log 2 3+1 log 2 4+1 log 2 (5+1)

The Loss Function Why not to optimize the following: n i=1 L(y i, z (i) ) n = [1 MAP y i, z i ] i=1 n i=1 L(y i, z (i) ) n = [1 ndcg y i, z i ] i=1 They are not continuous! Many optimization methods (e.g. Gradient Descent, Lagrange) can not be applied.

Probability Models We propose using two probability models to calculate the listwise loss function. We map a list of scores to a probability distribution using one of the two probability models and then take any metric between probability distributions as a loss function. The two models are referred to as permutation probability and top k probability.

Probability Models map distribution1 map distribution2 The difference between two distributions

Permutation Probability permutation: We use s to denote the list of scores: where is the score of the j-th object.

Permutation Probability The probability of permutation π given the list of scores s is defined as: where denotes the score of object at position j of permutation π.

Permutation Probability

Permutation Probability Theorem 3 indicates that, for a permutation in which an object with a larger score is ranked ahead of another object with a smaller score, if we exchange their positions, the permutation probability of the resulting permutation will be lower than that of the original permutation.

Permutation Probability Theorem 4 indicates given the scores of n objects, the list of objects sorted in descending order of the scores has the highest permutation probability, while the list of objects sorted in ascending order has the lowest permutation probability.

Permutation Probability

Permutation Probability Given two lists of scores, we can first calculate the two corresponding permutation probability distributions, and then take the metric between the two distributions as the listwise loss function. Since the number of permutations is of order O(n!), the calculation might be intractable in practice. To cope with the problem, we consider the use of top k probability.

Top-k Probability Before giving the definition of top k probability, we first define the top k subgroup: A n k

Top-k Probability

The Listwise Loss Function Given two lists of scores, we can define the metric between the corresponding top k probability distributions as the listwise loss function. For example, when we use Cross Entropy as metric, the listwise loss function is:

Learning Method: ListNet We propose a new learning method for optimizing the listwise loss function based on top k probability, with Neural Network as model and Gradient Descent as optimization algorithm. We refer to the method as ListNet.

Learning Method: ListNet

Experimental Results Data Collections: TREC, OHSUMED, Csearch Ranking Accuracy:

Experimental Results ndcg

Experimental Results The pairwise loss: Similar trends were observed on the results evaluated in terms of MAP.

Thank you