Similarity & Link Analysis. Stony Brook University CSE545, Fall 2016

Size: px

Start display at page:

Download "Similarity & Link Analysis. Stony Brook University CSE545, Fall 2016"

Francis Stokes
5 years ago
Views:

1 Similarity & Link nalysis Stony rook University SE545, Fall 6

2 Finding Similar Items? ( ecommendation-system-of-hive/) ( 3/8/entity-resolution-for-big-data)

3 Finding Similar Items : What we will cover Shingling Minhashing Locality-sensitive hashing istance Metrics

4 ocument Similarity hallenge: How to represent the document in a way that can be efficiently encoded and compared?

5 Shingles Goal: onvert documents to sets

6 Shingles Goal: onvert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k= doc= abcdabd singles(doc, ) = {ab, bc, cd, da, bd}

7 Shingles Goal: onvert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k= doc= abcdabd singles(doc, ) = {ab, bc, cd, da, bd} Similar documents have many common shingles hanging words or order has minimal effect. In practice use 5 < k <

8 Shingles Goal: onvert documents to sets Large enough that any given shingle k-shingles (aka character n-grams ) appearing a document is highly unlikely sequence of k characters (e.g. <.%- chance) an hash large shingles to smaller E.g. k= doc= abcdabd (e.g. 9-shingles into 4 bytes) singles(doc, ) = {ab, bc, cd, da, bd} an also use words (aka n-grams). Similar documents have many common shingles hanging words or order has minimal effect. In practice use 5 < k <

9 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

10 Minhashing Goal: onvert sets to shorter ids, signatures

11 Minhashing - ackground Goal: onvert sets to shorter ids, signatures Jaccard Similarity: haracteristic Matrix, X:. (Leskovec at al., 4; often very sparse! (lots of zeros) S S

12 Minhashing - ackground haracteristic Matrix: S S ab bc de ah ha ed ca Jaccard Similarity:

13 Minhashing - ackground haracteristic Matrix: S S ab ** bc * de * ah ** ha ed ** ca * Jaccard Similarity:

14 Minhashing - ackground haracteristic Matrix: S S ab ** bc * de * ah ** ha ed ** ca * Jaccard Similarity: sim(s, S) = 3 / 6 # both have / # at least one has

15 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

16 pproximate pproach: ) Instead of keeping whole characteristic Minhashing matrix, just keep first row where is haracteristic Matrix: X encountered. S S S3 S4 ab bc de ah ha ed ca (Leskovec at al., 4; ) Shuffle and repeat to get a signature for each set. Idea: We don t need to actually shuffle we can just use hash functions.

17 Minhashing haracteristic Matrix: S S S3 S4 ab bc de ah ha ed ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears.

18 Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. Minhashing haracteristic Matrix: S S S3 S4 permuted order ab ha bc ed de 3 ab ah 4 bc ha 5 ca ed 6 ah ca 7 de (Leskovec at al., 4;

19 Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. Minhashing haracteristic Matrix: S S S3 S4 permuted order 3 ab ha 4 bc ed 7 de 3 ab 6 ah 4 bc ha 5 ca ed 6 ah 5 ca 7 de (Leskovec at al., 4;

20 Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. Minhashing haracteristic Matrix: S S S3 S4 permuted order 3 ab ha 4 bc ed 7 de 3 ab 6 ah 4 bc ha 5 ca ed 6 ah 5 ca 7 de (Leskovec at al., 4; h(s) = ed #permuted row h(s) = ha #permuted row h(s3) =

21 Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. Minhashing haracteristic Matrix: S S S3 S4 permuted order 3 ab ha 4 bc ed 7 de 3 ab 6 ah 4 bc ha 5 ca ed 6 ah 5 ca 7 de (Leskovec at al., 4; h(s) = ed #permuted row h(s) = ha #permuted row h(s3) = ed #permuted row h(s4) =

22 Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. Minhashing haracteristic Matrix: S S S3 S4 permuted order 3 ab ha 4 bc ed 7 de 3 ab 6 ah 4 bc ha 5 ca ed 6 ah 5 ca 7 de (Leskovec at al., 4; h(s) = ed h(s) = ha h(s3) = ed h(s4) = ha #permuted row #permuted row #permuted row #permuted row

23 Minhashing haracteristic Matrix: S S S3 S4 3 ab 4 bc 7 de 6 ah ha ed 5 ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation h S S S3 S4 h(s) = ed #permuted row h(s) = ha #permuted row

24 Minhashing haracteristic Matrix: S S S3 S4 3 ab 4 bc 7 de 6 ah ha ed 5 ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation h S S S3 S4 h(s) = ed #permuted row h(s) = ha #permuted row

25 Minhashing haracteristic Matrix: S S S3 S4 3 ab 4 bc 7 de 6 ah ha ed 5 ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation h S S S3 S4 h(s) = ed #permuted row h(s) = ha #permuted row

26 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 4 bc 7 de 3 6 ah 6 ha 7 ed 5 5 ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation h h S S S3 S4

27 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 4 bc 7 de 3 6 ah 6 ha 7 ed 5 5 ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation S S S3 S4 h h 4

28 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation S S S3 S4 h h 4 h3

29 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Minhash function: h ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation S S S3 S4 h h 4 h3

30 Minhash function: h Minhashing haracteristic Matrix: X S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; ased on permutation of rows in the characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation S S S3 S4 h h 4 h

31 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Signature matrix: M Record first row where each set had a in the given permutation S S S3 S4 h h 4 h

32 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Thus, similarity of signatures S, S is the fraction of Signature matrix: M rows) in which they agree. minhash functions (i.e. Record first row where each set had a in the given permutation S S S3 S4 h h 4 h

33 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Thus, similarity of signatures S, S is the fraction of Signature matrix: M rows) in which they agree. minhash functions (i.e. Record first row where each set had a in the given permutation Estimate with a random sample of 7 permutations de (i.e. ~) ah 6 ha 5 7 ed ca (Leskovec at al., 4; S S S3 S4 h h 4 h

34 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Thus, similarity of signatures S, S is the fraction of Signature matrix: M rows) in which they agree. minhash functions (i.e. Record first row where each set had a in the given permutation Estimate with a random sample of 7 permutations de (i.e. ~) ah 6 ha 5 7 ed ca (Leskovec at al., 4; S S S3 S4 h h 4 h3 Estimated Sim(S, S3) = agree / all = /3

35 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Thus, similarity of signatures S, S is the fraction of Signature matrix: M rows) in which they agree. minhash functions (i.e. Record first row where each set had a in the given permutation S S S3 S4 h h 4 h3 Estimated Sim(S, S3) = agree / all = /3 Real Sim(S, S3) = Type a / (a + b + c) = 3/4

36 Minhashing haracteristic Matrix: S S S3 S4 4 3 ab 3 4 bc 7 7 de ah 6 ha 5 7 ed ca (Leskovec at al., 4; Property of signature matrix: Minhash function: The probability for hany hi (i.e. any row), that h (Sased on) permutation of rows ) = hi(s is the same as Sim(Sin,the S) i characteristic matrix, h maps sets to rows. Thus, similarity of signatures S, S is the fraction of Signature matrix: M rows) in which they agree. minhash functions (i.e. Record first row where each set had a in the given permutation S S S3 S4 h h 4 h3 Estimated Sim(S, S3) = agree / all = /3 Real Sim(S, S3) = Type a / (a + b + c) = 3/4 Try Sim(S, S4) and Sim(S, S)

37 Minhashing In Practice Problem: an t reasonably do permutations (huge space) an t randomly grab rows according to an order (random disk seeks = slow!)

38 Minhashing In Practice Problem: an t reasonably do permutations (huge space) an t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~ hash functions, hashes Store M[i][s] = a potential minimum hi(r) #initialized to infinity (num hashs x num sets)

39 Minhashing Solution: Use random hash functions. Setup: Pick ~ hash functions, hashes Store M[i][s] = a potential minimum hi(r) #initialized to infinity (num hashs x num sets) lgorithm: for r in rows of cm: #cm is characteristic matrix compute hi(r) for all i in hashes #precompute values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value if hi(r) < M[i][s]: M[i][s] = hi(r)

40 Minhashing Solution: Use random hash functions. Setup: Pick ~ hash functions, hashes Store M[i][s] = a potential minimum hi(r) #initialized to infinity (num hashs x num sets) Known as efficient minhashing. lgorithm: for r in rows of cm: #cm is characteristic matrix compute hi(r) for all i in hashes #precompute values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value if hi(r) < M[i][s]: M[i][s] = hi(r)

41 Minhashing What hash functions to use? Start with decent hash functions e.g. ha(x) = ascii(string) % large_prime_number hb(x) = (3*ascii(string) + 6) % large_prime_number dd together multiplying the second times i: hi(x) = ha(x) + i*hb(x) e.g. h5(x) = ha(x) + 5*hb(x)

42 Minhashing What hash functions to use? Start with decent hash functions e.g. ha(x) = ascii(string) % large_prime_number hb(x) = (3*ascii(string) + 6) % large_prime_number dd together multiplying the second times i: hi(x) = ha(x) + i*hb(x) e.g. h5(x) = ha(x) + 5*hb(x)

43 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

44 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document). New Problem: Even if the size of signatures are small, it can be computationally expensive to find similar pairs. E.g. m documents;,, choose = 5,,, pairs

45 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). andidate pairs: pairs of elements to be evaluated for similarity.

46 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). andidate pairs: pairs of elements to be evaluated for similarity. If we wanted the similarity for all pairs of documents, could anything be done?

47 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). andidate pairs: pairs of elements to be evaluated for similarity. pproach: Hash multiple times over subsets of data: similar items are likely in the same bucket once.

48 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). andidate pairs: pairs of elements to be evaluated for similarity. pproach: Hash multiple times over subsets of data: similar items are likely in the same bucket once. pproach from MinHash: Hash columns of signature matrix andidate pairs end up in the same bucket.

49 Step : dd bands Locality-Sensitive Hashing (Leskovec at al., 4;

50 Step : dd bands Locality-Sensitive Hashing an be tuned to catch most true-positives with least false-positives. (Leskovec at al., 4;

51 Locality-Sensitive Hashing Step : dd bands Step : Hash columns within bands (Leskovec at al., 4;

52 Locality-Sensitive Hashing Step : dd bands Step : Hash columns within bands (Leskovec at al., 4;

53 Locality-Sensitive Hashing Step : dd bands Step : Hash columns within bands (Leskovec at al., 4;

54 Step : dd bands Step : Hash columns within bands Locality-Sensitive Hashing riteria for being candidate pair: They end up in same bucket for at least band. (Leskovec at al., 4;

Locality-Sensitive Hashing Step : dd bands Step : Hash columns within bands Simplification: There are enough buckets compared to rows per band that

55 Locality-Sensitive Hashing Step : dd bands Step : Hash columns within bands Simplification: There are enough buckets compared to rows per band that columns must be identical in order to hash to the same bucket. Thus, we only need to check if identical within a band. (Leskovec at al., 4;

56 ocument Similarity Pipeline Shingling Minhashing Localitysensitive hashing

57 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion)

58 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion) bands of 5 rows Want 8% Jaccard Similarity ; for any row p(s == S) =.8

59 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion) bands of 5 rows Want 8% Jaccard Similarity ; for any row p(s == S) =.8 P(S==S b): probability S and S agree within a given band

60 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion) bands of 5 rows Want 8% Jaccard Similarity ; for any row p(s == S) =.8 P(S==S b): probability S and S agree within a given band =.85 =.38 => P(S!=S b) = -.38 =.67 P(S!=S): probability S and S do not agree in any band

61 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion) bands of 5 rows Want 8% Jaccard Similarity ; for any row p(s == S) =.8 P(S==S b): probability S and S agree within a given band =.85 =.38 => P(S!=S b) = -.38 =.67 P(S!=S): probability S and S do not agree in any band =.67 =.35

62 Realistic Example: Probabilities of agreement, documents random permutations/hash functions/rows => if 4byte integers then 4Mb to hold signature matrix => still k choose is a lot (~5billion) bands of 5 rows Want 8% Jaccard Similarity ; for any row p(s == S) =.8 P(S==S b): probability S and S agree within a given band =.85 =.38 => P(S!=S b) = -.38 =.67 P(S!=S): probability S and S do not agree in any band =.67 =.35 What if wanting 4% Jaccard Similarity?

63 istance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard istance ( - Jaccard Sim). (

64 istance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard istance ( - Jaccard Sim). Typical properties of a distance metric, d: d(x, x) = d(x, y) = d(y, x) d(x, y) d(x,z) + d(z,y) (

65 istance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard istance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean istance osine istance Edit istance Hamming istance

66 istance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard istance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean istance osine istance Edit istance Hamming istance ( L Norm )

67 istance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard istance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean istance osine istance Edit istance Hamming istance ( L Norm )

68 Locality Sensitive Hashing - Theory LSH an be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar.

69 Locality Sensitive Hashing - Theory LSH an be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar. E.g. for euclidean distance: hoose random lines (analogous to hash functions in minhashing) Project the two points onto each line; match if two points within an interval

70 Link nalysis

71 The Web, circa 998

72 The Web, circa 998 Match keywords, language (information retrieval) Explore directory

73 The Web, circa 998 Easy to game with term spam Match keywords, language (information retrieval) Explore directory Time-consuming; Not open-ended

74 Enter PageRank...

75 PageRank Key Idea: onsider the citations of the website.

76 PageRank Key Idea: onsider the citations of the website. Who links to it? and what are their citations?

77 PageRank Key Idea: onsider the citations of the website. Who links to it? and what are their citations? Innovation : What pages would a random Web surfer end up at? Innovation : Not just own terms but what terms are used by citations?

78 PageRank View : Flow Model: in-links as votes Innovation : What pages would a random Web surfer end up at? Innovation : Not just own terms but what terms are used by citations?

79 PageRank View : Flow Model: in-links (citations) as votes but, citations from important pages should count more. => Use recursion to figure out if each page is important. Innovation : What pages would a random Web surfer end up at? Innovation : Not just own terms but what terms are used by citations?

80 PageRank View : Flow Model: How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

81 PageRank r/ View : Flow Model: r/ r/4 r = r/ + r/4 + r/ How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

82 PageRank View : Flow Model: How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

83 PageRank View : Flow Model: System of Equations: How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

84 PageRank View : Flow Model: System of Equations: How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

85 PageRank View : Flow Model: Solve How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is out-links )

86 PageRank to \ from / /3 / /3 / /3 / Transition Matrix, M

87 Innovation: What pages would a random Web surfer end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] View : Matrix Formulation to \ from / /3 / /3 / /3 / Transition Matrix, M

88 Innovation: What pages would a random Web surfer end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after st iteration: M r = [3/8, 5/4, 5/4, 5/4] after nd iteration: M(M r) = M r = [5/48, /48, ] View : Matrix Formulation to \ from / /3 / /3 / /3 / Transition Matrix, M

89 Innovation: What pages would a random Web surfer end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after st iteration: M r = [3/8, 5/4, 5/4, 5/4] after nd iteration: M(M r) = M r = [5/48, /48, ] Power iteration algorithm initialize: r[] = [/N,, /N], r[-]=[,...,] while (err_norm(r[t],r[t-])>min_err): err_norm(v, v) = v - v #L norm to \ from / /3 / /3 / /3 / Transition Matrix, M

90 Innovation: What pages would a random Web surfer end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after st iteration: M r = [3/8, 5/4, 5/4, 5/4] after nd iteration: M(M r) = M r = [5/48, /48, ] Power iteration algorithm initialize: r[] = [/N,, /N], r[-]=[,...,] while (err_norm(r[t],r[t-])>min_err): r[t+] = M r[t] t+= solution = r[t] err_norm(v, v) = v - v #L norm to \ from / /3 / /3 / /3 / Transition Matrix, M

91 s err_norm gets smaller we are moving toward: r = M r View 3: Eigenvectors: Power iteration algorithm initialize: r[] = [/N,, /N], r[-]=[,...,] while (err_norm(r[t],r[t-])>min_err): r[t+] = M r[t] t+= solution = r[t] err_norm(v, v) = v - v #L norm

92 s err_norm gets smaller we are moving toward: r = M r View 3: Eigenvectors: We are actually just finding the eigenvector of M.... e h t ds fin Power iteration algorithm initialize: r[] = [/N,, /N], r[-]=[,...,] while (err_norm(r[t],r[t-])>min_err): r[t+] = M r[t] t+= solution = r[t] err_norm(v, v) = v - v #L norm x is an eigenvector of x = x if:

93 s err_norm gets smaller we are moving toward: r = M r View 3: Eigenvectors: We are actually just finding the eigenvector of M.... e h t ds fin Power iteration algorithm initialize: x is an eigenvector of x = x r[] = [/N,, /N], if: r[-]=[,...,] while (err_norm(r[t],r[t-])>min_err): r[t+] = M r[t] = t+= since columns of M sum to. solution = r[t] thus, r=mr err_norm(v, v) = v - v #L norm

94 View 4: Markov Process Where is surfer at time t+? p(t+) = M p(t) Suppose: p(t+) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node.

95 View 4: Markov Process Where is surfer at time t+? p(t+) = M p(t) Suppose: p(t+) = p(t), then p(t) is a stationary distribution of a random walk. Thus, r is a stationary distribution. Probability of being at given node. aka st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends : a node can t propagate its rank No spider traps : set of nodes with no way out. lso known as being stochastic, irreducible, and aperiodic.

96 View 4: Markov Process - Problems for vanilla PI to \ from /3 /3 /3 What would r converge to? aka st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends : a node can t propagate its rank No spider traps : set of nodes with no way out. lso known as being stochastic, irreducible, and aperiodic.

97 View 4: Markov Process - Problems for vanilla PI to \ from /3 /3 /3 What would r converge to? aka st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: No dead-ends : a node can t propagate its rank No spider traps : set of nodes with no way out. lso known as being stochastic, irreducible, and aperiodic.

98 View 4: Markov Process - Problems for vanilla PI to \ from /3 /3 /3 What would r converge to? aka st order Markov Process Rich probabilistic theory. One finding: Stationary distributions have a unique distribution if: columns sum to same node doesn t repeat at regular intervals non-zero chance of going to any other node lso known as being stochastic, irreducible, and aperiodic.

99 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - )

100 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from ⅓ ⅓ ⅓

101 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from +.5*¼ +.5*¼ ⅓ +.5*¼.85*+.5*¼ ⅓ +.5*¼ +.5*¼ ⅓.85* +.5*¼ +.5*¼

102 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from +.5*¼ +.5*¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ +.5*¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ +.5*¼ +.5*¼ +.5*¼.85*⅓+.5*¼.85*+.5*¼ +.5*¼ +.5*¼

103 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from ⅓ ⅓ ⅓

104 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from ¼ ⅓ ¼ ⅓ ¼ ⅓ ¼

105 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) to \ from.85*¼+.5*¼ ⅓.85*¼+.5*¼ ⅓.85*¼+.5*¼ ⅓.85*¼+.5*¼

106 Goals: No dead-ends No spider traps The Google PageRank Formulation dd teleportation:t each step, two choices. Follow a random link (probability, = ~.85). Teleport to a random node (probability, - ) (Teleport from a dead-end has probability ) to \ from +.5*¼ *¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼

107 Teleportation, as Flow Model: Goals: No dead-ends No spider traps (rin and Page, 998) to \ from +.5*¼ *¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼

108 Teleportation, as Flow Model: Goals: No dead-ends No spider traps (rin and Page, 998) Teleportation, as Matrix Model: to \ from +.5*¼ *¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼

109 Teleportation, as Flow Model: Goals: No dead-ends No spider traps (rin and Page, 998) Teleportation, as Matrix Model: to \ from +.5*¼.85*¼+.5*¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼.85*¼+.5*¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼.85*¼+.5*¼ +.5*¼ +.5*¼.85*⅓+.5*¼.85*¼+.5*¼ +.5*¼ +.5*¼

110 Teleportation, as Flow Model: Goals: No dead-ends No spider traps (rin and Page, 998) Teleportation, as Matrix Model: To apply: run power iterations over M instead of M. to \ from +.5*¼ *¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼

111 Teleportation, as Flow Model: Goals: No dead-ends No spider traps (rin and Page, 998) Teleportation, as Matrix Model: Steps:. ompute M. dd /N to all dead-ends. 3. onvert M to M 4. Run Power Iterations. to \ from +.5*¼ *¼ 85*+.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼.85*+.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼.85*⅓+.5*¼ *¼ +.5*¼ +.5*¼

Markov Chains in Pop Culture

Markov Chains in Pop Culture Lola Thompson November 29, 2010 1 of 21 Introduction There are many examples of Markov Chains used in science and technology. Here are some applications in pop culture: 2 of