Recap from previous lectures. Information Retrieval. Recap from previous lectures. Topics for Today. Dictionaries & Tolerant Retrieval.
|
|
- Nickolas Wade
- 5 years ago
- Views:
Transcription
1 Recap from previous lectures nformation Retrieval Dictionaries & Tolerant Retrieval Jörg Tiedemann Department of Linguistics and Philology Uppsala University nverted indexes dictionary & postings type/token distinction terms = normalized types put in the dictionary Boolean Model return exact matches for Boolean queries Jörg Tiedemann 1/55 Jörg Tiedemann 2/55 Recap from previous lectures Topics for Today Vector Space Model term weights (TF-DF) ranked retrieval dimensionality reduction (LSA/R) Link Analysis Mining in linked data sets (networks) PageRank, Hubs, Authorities Web Crawling Going back to the basics (inverted indexes) 1. Skip pointers 2. Phrase queries 3. Dictionary data structures 4. Tolerant retrieval wild-card queries spelling correction Soundex Jörg Tiedemann 3/55 Jörg Tiedemann 4/55
2 Recall: Basic structure of an nverted index Faster intersection with skip pointers For each term t, we store a list of all documents that contain t. BRUTUS! skip sections of postings with short cuts more efficient with long posting lists CAESAR! CALPURNA! {z } {z } dictionary postings How do we use the skip pointers? Where do we put skip pointers? Jörg Tiedemann 5/55 Jörg Tiedemann 6/55 Where do we place skips? Phrase queries Tradeoff: number of items skipped vs. frequency of skips More skips: Each skip pointer skips only a few items, but we can frequently use it. Fewer skips: Each skip pointer skips many items, but we can not use it very often. Simple heuristic: for postings list of length P, use p P evenly-spaced skip pointers (ignoring term distribution) Easy for static index, harder for dynamic ones! Gain is not much with today s fast CPUs! Query: stanford university as a phrase (10% of web queries = phrase queries) dea one: Biword index! index every word bigram longer phrases: stanford university palo alto : Problems? 1. STANFORD UNVERSTY AND UNVERSTY PALO AND PALO ALTO 2. post-filtering of all hits Jörg Tiedemann 7/55 Jörg Tiedemann 8/55
3 Phrase queries Proximity search dea two: Positional indexes! (store positions in posting list) Example query: to 1 be 2 or 3 not 4 to 5 be 6 TO, : h 1: h 7, 18, 33, 72, 86, 231i; 2: h1, 17, 74, 222, 255i; 4: h 8, 16, 190, 429, 433i; 5: h363, 367i; 7: h13, 23, 191i;...i BE, : h 1: h 17, 25i; 4: h 17, 191, 291, 430, 434i; 5: h14, 19, 101i;...i Document 4 is a match! Jörg Tiedemann 9/55 Second advantage of positional indexes: Can also use them for proximity search. For example: employment /4 place Find all documents that contain EMPLOYMENT and PLACE within 4 words of each other. Jörg Tiedemann 10/55 Data Structures for Dictionaries Naive Dictionary: array of fixed-width entries For each term t, we store a list of all documents that contain t. BRUTUS! CAESAR! CALPURNA! {z } {z } dictionary postings term document pointer to frequency postings list a 656,265! aachen 65! zulu 221! space needed: 20 bytes 4 bytes 4 bytes How do we store a dictionary in memory efficiently? How do we look up an element in this array at query time? Jörg Tiedemann 11/55 Jörg Tiedemann 12/55
4 Dictionary Data structures Hashes Two main classes of data structures: hashes and trees Criteria for when to use hashes vs. trees: s there a fixed number of terms or will it keep growing? What are the relative frequencies with which various keys will be accessed? How many terms are we likely to have? Each vocabulary term is hashed into an integer. Try to avoid collisions At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array Pros: Lookup is fast (faster than in a search tree) Lookup time is constant Cons no way to find minor variants (resume vs. résumé) no prefix search (all terms starting with automat) need to rehash everything periodically if vocabulary keeps growing Jörg Tiedemann 13/55 Jörg Tiedemann 14/55 Trees: Binary tree Trees: Binary tree simplest tree structure efficient for searching Pros: solves the prefix problem (terms starting with automat) Cons: slower: O(log M) in balanced trees (M is the size of the vocabulary) re-balancing binary trees is expensive! Alternative: B-tree Jörg Tiedemann 15/55 Jörg Tiedemann 16/55
5 Trees: B-tree Trees: B-tree What s the difference? B-tree definition: every internal node has a number of children in the interval [a, b] where a, b are appropriate positive integers, e.g., [2, 4]. slightly more complex still efficient for searching same features as binary trees (prefix search!) need for re-balancing is less frequent! Jörg Tiedemann 17/55 Jörg Tiedemann 18/55 Tolerant Retrieval Tolerant Retrieval: Wildcard queries Wildcard queries Spelling correction Soundex Prefix queries: mon*: find all docs containing any term beginning with mon Easy with B-tree dictionary: retrieve all terms t in the range: mon apple t < moo Suffix Queries: *mon: find all docs containing any term ending with mon Maintain an additional tree for terms backwards Then retrieve all terms t in the range: nom apple t < non How can we find all terms matching pro*cent? Jörg Tiedemann 19/55 Jörg Tiedemann 20/55
6 How to handle * in the middle of a term Permuterm index Example: pro*cent We could look up pro* and *cent in the B-tree and intersect the two term sets. Expensive! Alternative: permuterm index! Rotate wildcard query, so that the * occurs at the end. introduce special symbol $ to indicate end of string! pro*cent becomes cent$pro* How does this help when matching queries with the index? Jörg Tiedemann 21/55 Jörg Tiedemann 22/55 Permuterm index Permuterm index & term mapping all rotated entries map to the same string... Add all rotations to B-tree: HELLO! hello$, ello$h, llo$he, lo$hel o$hell For X, look up X$ For X*, look up X* For *X, look up X$* For *X*, look up X* For X*Y, look up Y$X* Example: For hel*o, look up o$hel* Jörg Tiedemann 23/55 Jörg Tiedemann 24/55
7 Processing a lookup in the permuterm index Alternative: Bigram (k-gram) indexes To sum up: Rotate query wildcard to the right Use B-tree lookup as before What is the problem with this structure? Permuterm more than quadruples the size of the dictionary compared to a regular B-tree. (empirical observation for English) Enumerate all k-grams (sequence of k characters) occurring in a term! more space-efficient than permuterm index Example: from April is the cruelest month we get the 2-grams (bigrams): $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ru ue el le es st t$ $m mo on nt h$ $ is a special word boundary symbol, as before. Maintain a second inverted index from bigrams to the dictionary terms that contain the bigram Jörg Tiedemann 25/55 Jörg Tiedemann 26/55 Postings list in a 3-gram inverted index Processing wildcarded terms in a bigram index... etr - BEETROOT - METRC - PETRFY - RETREVAL retrieve all postings of matching k-grams intersect all lists as usual Query mon* can now be run as: $m AND mo AND on Gets us all terms with the prefix mon, but also many false positives like MOON.! Must post-filter these terms against query. Surviving terms are then looked up in the term-document inverted index.! k-gram are still fast and more space efficient than permuterm indexes. Jörg Tiedemann 27/55 Jörg Tiedemann 28/55
8 Processing wildcard queries Tolerant Retrieval: Spälling Korrection As before, we must potentially execute a large number of Boolean queries Most straightforward semantics: Conjunction of disjunctions Recall the query: gen* AND universit* (geneva AND university) OR (geneva AND université) OR (genève AND university) OR (genève AND université) OR (general AND universities) OR...! Very expensive!! Requires query optimization! Do we need to support wildcard queries? Users are lazy! f wildcards are allowed! Users will love it (?) Does Google allow wildcard queries? (And other engines?) Two general uses of spelling correction: Correcting documents being indexed Correcting user queries (more common) Two different methods: solated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form/from error above Jörg Tiedemann 29/55 Jörg Tiedemann 30/55 Correcting documents Correcting queries primarily for OCR ed documents tuned for OCR mistakes (trained classifiers) may use domain-specific knowledge confusion between O and D etc... but also: correct typos in web pages and low quality documents! fewer misspellings in our dictionary and better matching general R philosophy: don t change the documents more common in R than document correction typos in queries are common people are in a hurry users often look for things they don t know much about example: al quajda strategies: (also) retrieve documents with the correct spelling return alternative query suggestions ( Did you mean...? ) Jörg Tiedemann 31/55 Jörg Tiedemann 32/55
9 solated word correction solated word correction Premise 1: There is a list of correct words from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the correct word that has the smallest distance to the misspelled word. Example: informaton! information Two choices: use a standard lexicon Webster s, OED etc.... industry-specific dictionary (for domain-specific R) advantage: correct entries only! vocabulary of the inverted index all words in the collection! better coverage but: include all misspellings! compute weights for all terms (based on frequencies) Jörg Tiedemann 33/55 Jörg Tiedemann 34/55 solated word correction Edit distance Task: Return lexicon entry that is closest to a given character sequence Q What is closest? Several alternatives: 1. Edit distance and Levenshtein distance 2. Weighted edit distance 3. k-gram overlap The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2. Levenshtein distance: basic operations = insert, delete, and replace Levenshtein distance dog-do: 1 Levenshtein distance cat-cart: 1 Levenshtein distance cat-cut: 1 Levenshtein distance cat-act: 2 Damerau-Levenshtein: additional operation = transpose Damerau-Levenshtein distance cat-act: 1 Jörg Tiedemann 35/55 Jörg Tiedemann 36/55
10 Levenshtein distance: Computation Recursive definition & dynamic programming: start with upper-left table cell fill table with edit costs f a s t c a t s Levenshtein distance: algorithm LEVENSHTENDSTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] =i 3 for j 0 to s 2 4 do m[0, j] =j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] =s 2 [j] 8 then m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]} 9 else m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]+1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Jörg Tiedemann 37/55 Jörg Tiedemann 38/55 Levenshtein distance: algorithm LEVENSHTENDSTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] =i 3 for j 0 to s 2 4 do m[0, j] =j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] =s 2 [j] 8 then m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]} 9 else m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]+1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Levenshtein distance: algorithm LEVENSHTENDSTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] =i 3 for j 0 to s 2 4 do m[0, j] =j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] =s 2 [j] 8 then m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]} 9 else m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]+1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Jörg Tiedemann 39/55 Jörg Tiedemann 40/55
11 Levenshtein distance: algorithm LEVENSHTENDSTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] =i 3 for j 0 to s 2 4 do m[0, j] =j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] =s 2 [j] 8 then m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]} 9 else m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]+1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Levenshtein distance: algorithm LEVENSHTENDSTANCE(s 1, s 2 ) 1 for i 0 to s 1 2 do m[i, 0] =i 3 for j 0 to s 2 4 do m[0, j] =j 5 for i 1 to s 1 6 do for j 1 to s 2 7 do if s 1 [i] =s 2 [j] 8 then m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]} 9 else m[i, j] =min{m[i 1, j]+1, m[i, j 1]+1, m[i 1, j 1]+1} 10 return m[ s 1, s 2 ] Operations: insert, delete, replace, copy Jörg Tiedemann 41/55 Jörg Tiedemann 42/55 Each cell of Levenshtein matrix Levenshtein distance: Example f a s t cost of getting here from my upper left neighbor (copy or replace) cost of getting here from my left neighbor (insert) cost of getting here from my upper neighbor (delete) the minimum of the three possible moves ; the cheapest way of getting here c a t s Jörg Tiedemann 43/55 Jörg Tiedemann 44/55
12 Weighted edit distance Using edit distance As above, but weight of an operation depends on the characters involved. Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q. Therefore, replacing m by n is a smaller edit distance than by q. We now require a weight matrix as input. Modify dynamic programming to handle weights. given a query: get all sequences within a fixed edit distance intersect this list with the list of correct words return spelling suggestions Alternatively: use all corrections to retrieve doc s! slow! (and accepted by user?) use single best correction for retrieval Jörg Tiedemann 45/55 Jörg Tiedemann 46/55 Using edit distance Problems: lot s of possible strings even with few edit operations intersection with dictionary is slow Possible solution: use N-gram overlap can replace edit distance for spelling correction k-gram indexes for spelling correction Get all k-grams in the query term Use the k-gram index to retrieve correct words that match query term k-grams (recall wildcard search) Threshold by number of matching k-grams (e.g., only terms that differ by at most 3 k-grams) or use Jaccard coefficient > threshold Example: Jaccard(A, B) = A \ B A [ B Bigram index, misspelled word bordroom Bigrams: bo, or, rd, dr, ro, oo, om Jörg Tiedemann 47/55 Jörg Tiedemann 48/55
13 k-gram indexes for spelling correction: bordroom Context-sensitive spelling correction BO - aboard - about -boardroom - border OR - border - lord - morbid - sordid RD - aboard - ardent -boardroom - border One approach: break phrase query into conjunction of biwords look for biwords that need only one term corrected get phrase matches and rank them......! boardroom exists in 6 out of 7 lists! Jaccard = 6/( ) =6/ Jörg Tiedemann 49/55 Jörg Tiedemann 50/55 Context-sensitive spelling correction Another approach: Hit-based spelling correction Example: flew form munich Try all phrases with possible corrections: Try query flea form munich Try query flew from munich Try query flew form munch The correct query flew from munich has most hits. Many alternatives!! Not very efficient! try to correct only if few hits returned tweaking with query logs and expected hits Tolerant Search: Soundex Find phonetic alternatives. Example: chebyshev / tchebyscheff Soundex = class of heuristics (invented in 1918) 1. Retain the first letter of the term. 2. Replace all letters [AEOUHWY] to 0 3. Replace letters to digits: B, F, P, V to 1 C, G, J, K, Q, S, X, Z to 2 D,T to 3 L to 4 M, N to 5 R to 6 4. remove consecutive identical digits 5. remove all zeros and return first 4 characters Jörg Tiedemann 51/55 HERMAN! H655 Will HERMANN generate the same code? Jörg Tiedemann 52/55
14 To sum up Resources Using a positional inverted index with skip pointers efficient dictionary storage wild-card index spelling correction Soundex We can quickly run a query like (SPELL(moriset) /3 tor*to) OR SOUNDEX(chaikofski) Chapter 3 of R Resources at Soundex demo Levenshtein distance demo Levenshtein distance slides Peter Norvig s spelling corrector Jörg Tiedemann 53/55 Jörg Tiedemann 54/55 Next time Ranked Retrieval: efficient scoring & retrieval Putting it all together in a search system lab on boolean and ranked retrieval Later: text extraction automatic summarization Jörg Tiedemann 55/55
Recap from previous lecture. Information Retrieval. Topics for Today. Recall: Basic structure of an Inverted index. Dictionaries & Tolerant Retrieval
Recap from previous lecture nformation Retrieval Dictionaries & Tolerant Retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University nverted indexes
More informationMidterm for Name: Good luck! Midterm page 1 of 9
Midterm for 6.864 Name: 40 30 30 30 Good luck! 6.864 Midterm page 1 of 9 Part #1 10% We define a PCFG where the non-terminals are {S, NP, V P, V t, NN, P P, IN}, the terminal symbols are {Mary,ran,home,with,John},
More informationDesign of Parallel Algorithms. Communication Algorithms
+ Design of Parallel Algorithms Communication Algorithms + Topic Overview n One-to-All Broadcast and All-to-One Reduction n All-to-All Broadcast and Reduction n All-Reduce and Prefix-Sum Operations n Scatter
More informationLectures: Feb 27 + Mar 1 + Mar 3, 2017
CS420+500: Advanced Algorithm Design and Analysis Lectures: Feb 27 + Mar 1 + Mar 3, 2017 Prof. Will Evans Scribe: Adrian She In this lecture we: Summarized how linear programs can be used to model zero-sum
More informationHUFFMAN CODING. Catherine Bénéteau and Patrick J. Van Fleet. SACNAS 2009 Mini Course. University of South Florida and University of St.
Catherine Bénéteau and Patrick J. Van Fleet University of South Florida and University of St. Thomas SACNAS 2009 Mini Course WEDNESDAY, 14 OCTOBER, 2009 (1:40-3:00) LECTURE 2 SACNAS 2009 1 / 10 All lecture
More informationSimilarity & Link Analysis. Stony Brook University CSE545, Fall 2016
Similarity & Link nalysis Stony rook University SE545, Fall 6 Finding Similar Items? (http://blog.soton.ac.uk/hive//5//r ecommendation-system-of-hive/) (http://www.datacommunitydc.org/blog/ 3/8/entity-resolution-for-big-data)
More informationPermutation Editing and Matching via Embeddings
Permutation Editing and Matching via Embeddings Graham Cormode, S. Muthukrishnan, Cenk Sahinalp (grahamc@dcs.warwick.ac.uk) Permutation Editing and Matching Why study permutations? Distances between permutations
More information11/13/18. Introduction to RNNs for NLP. About Me. Overview SHANG GAO
Introduction to RNNs for NLP SHANG GAO About Me PhD student in the Data Science and Engineering program Took Deep Learning last year Work in the Biomedical Sciences, Engineering, and Computing group at
More informationSelf-Adjusting Binary Search Trees. Andrei Pârvu
Self-Adjusting Binary Search Trees Andrei Pârvu Andrei Pârvu 13-05-2015 1 Motivation Andrei Pârvu 13-05-2015 2 Motivation: Find Andrei Pârvu 13-05-2015 3 Motivation: Insert Andrei Pârvu 13-05-2015 4 Motivation:
More information1 This work was partially supported by NSF Grant No. CCR , and by the URI International Engineering Program.
Combined Error Correcting and Compressing Codes Extended Summary Thomas Wenisch Peter F. Swaszek Augustus K. Uht 1 University of Rhode Island, Kingston RI Submitted to International Symposium on Information
More informationMITOCW watch?v=krzi60lkpek
MITOCW watch?v=krzi60lkpek The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To
More informationCoding for Efficiency
Let s suppose that, over some channel, we want to transmit text containing only 4 symbols, a, b, c, and d. Further, let s suppose they have a probability of occurrence in any block of text we send as follows
More informationPatent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis
Patent Mining: Use of Data/Text Mining for Supporting Patent Retrieval and Analysis by Chih-Ping Wei ( 魏志平 ), PhD Institute of Service Science and Institute of Technology Management National Tsing Hua
More informationVariant Calling. Michael Schatz. Feb 20, 2018 Lecture 7: Applied Comparative Genomics
Variant Calling Michael Schatz Feb 20, 2018 Lecture 7: Applied Comparative Genomics Mission Impossible 1. Setup VirtualBox 2. Initialize Tools 3. Download Reference Genome & Reads 4. Decode the secret
More informationInformation Retrieval and Text Mining
Information Retrieval and Text Mining http://informationretrieval.org IIR 2: The term vocabulary and postings lists Hinrich Schütze & Wiltrud Kessler Institute for Natural Language Processing, University
More informationChannel Coding RADIO SYSTEMS ETIN15. Lecture no: Ove Edfors, Department of Electrical and Information Technology
RADIO SYSTEMS ETIN15 Lecture no: 7 Channel Coding Ove Edfors, Department of Electrical and Information Technology Ove.Edfors@eit.lth.se 2012-04-23 Ove Edfors - ETIN15 1 Contents (CHANNEL CODING) Overview
More information(Lec19) Geometric Data Structures for Layouts
Page 1 (Lec19) Geometric Data Structures for Layouts What you know Some basic ASIC placement (by annealing) Some basic ASIC routing (global versus detailed, area routing by costbased maze routing) Some
More informationPrinceton ELE 201, Spring 2014 Laboratory No. 2 Shazam
Princeton ELE 201, Spring 2014 Laboratory No. 2 Shazam 1 Background In this lab we will begin to code a Shazam-like program to identify a short clip of music using a database of songs. The basic procedure
More informationMITOCW R9. Rolling Hashes, Amortized Analysis
MITOCW R9. Rolling Hashes, Amortized Analysis The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources
More informationLecture5: Lossless Compression Techniques
Fixed to fixed mapping: we encoded source symbols of fixed length into fixed length code sequences Fixed to variable mapping: we encoded source symbols of fixed length into variable length code sequences
More informationProCo 2017 Advanced Division Round 1
ProCo 2017 Advanced Division Round 1 Problem A. Traveling file: 256 megabytes Moana wants to travel from Motunui to Lalotai. To do this she has to cross a narrow channel filled with rocks. The channel
More informationSCRABBLE ARTIFICIAL INTELLIGENCE GAME. CS 297 Report. Presented to. Dr. Chris Pollett. Department of Computer Science. San Jose State University
SCRABBLE AI GAME 1 SCRABBLE ARTIFICIAL INTELLIGENCE GAME CS 297 Report Presented to Dr. Chris Pollett Department of Computer Science San Jose State University In Partial Fulfillment Of the Requirements
More informationSONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS
SONG RETRIEVAL SYSTEM USING HIDDEN MARKOV MODELS AKSHAY CHANDRASHEKARAN ANOOP RAMAKRISHNA akshayc@cmu.edu anoopr@andrew.cmu.edu ABHISHEK JAIN GE YANG ajain2@andrew.cmu.edu younger@cmu.edu NIDHI KOHLI R
More informationLog-linear models (part 1I)
Log-linear models (part 1I) Lecture, Feb 2 CS 690N, Spring 2017 Advanced Natural Language Processing http://people.cs.umass.edu/~brenocon/anlp2017/ Brendan O Connor College of Information and Computer
More informationSpring 06 Assignment 2: Constraint Satisfaction Problems
15-381 Spring 06 Assignment 2: Constraint Satisfaction Problems Questions to Vaibhav Mehta(vaibhav@cs.cmu.edu) Out: 2/07/06 Due: 2/21/06 Name: Andrew ID: Please turn in your answers on this assignment
More informationThe revolution of the empiricists. Machine Translation. Motivation for Data-Driven MT. Machine Translation as Search
The revolution of the empiricists Machine Translation Word alignment & Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Classical approaches
More informationStatistical Machine Translation. Machine Translation Phrase-Based Statistical MT. Motivation for Phrase-based SMT
Statistical Machine Translation Machine Translation Phrase-Based Statistical MT Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University October 2009 Probabilistic
More informationPhiladelphia Classic 2013 Hosted by the Dining Philosophers University of Pennsylvania
Philadelphia Classic 2013 Hosted by the Dining Philosophers University of Pennsylvania Basic rules: 4 hours, 9 problems, 1 computer per team You can only use the internet for accessing the Javadocs, and
More informationMichael Clausen Frank Kurth University of Bonn. Proceedings of the Second International Conference on WEB Delivering of Music 2002 IEEE
Michael Clausen Frank Kurth University of Bonn Proceedings of the Second International Conference on WEB Delivering of Music 2002 IEEE 1 Andreas Ribbrock Frank Kurth University of Bonn 2 Introduction Data
More informationCombining Large Datasets of Patents and Trademarks
Combining Large Datasets of Patents and Trademarks Grid Thoma Computer Science Division, School of Science & Technology University of Camerino 14 th Italian STATA User Annual Meeting Florence, 16 Nov 2017
More informationDeveloping an Artificial Intelligence to Play the Board Game Scrabble
UNIVERSITY COLLEGE DUBLIN TRINITY COLLEGE DISSERTATION MAI IN ELECTRONIC & COMPUTER ENGINEERING Developing an Artificial Intelligence to Play the Board Game Scrabble Author: Carl O CONNOR Supervisor: Dr.
More informationEE251: Tuesday October 10
EE251: Tuesday October 10 Analog to Digital Conversion Text Chapter 20 through section 20.2 TM4C Data Sheet Chapter 13 Lab #5 Writeup Lab Practical #1 this week Homework #4 is due on Thursday at 4:30 p.m.
More informationImage analysis. CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror
Image analysis CS/CME/BioE/Biophys/BMI 279 Oct. 31 and Nov. 2, 2017 Ron Dror 1 Outline Images in molecular and cellular biology Reducing image noise Mean and Gaussian filters Frequency domain interpretation
More informationRADIO SYSTEMS ETIN15. Channel Coding. Ove Edfors, Department of Electrical and Information Technology
RADIO SYSTEMS ETIN15 Lecture no: 7 Channel Coding Ove Edfors, Department of Electrical and Information Technology Ove.Edfors@eit.lth.se 2016-04-18 Ove Edfors - ETIN15 1 Contents (CHANNEL CODING) Overview
More informationCombined Permutation Codes for Synchronization
ISITA2012, Honolulu, Hawaii, USA, October 28-31, 2012 Combined Permutation Codes for Synchronization R. Heymann, H. C. Ferreira, T. G. Swart Department of Electrical and Electronic Engineering Science
More informationImage Processing. Adrien Treuille
Image Processing http://croftonacupuncture.com/db5/00415/croftonacupuncture.com/_uimages/bigstockphoto_three_girl_friends_celebrating_212140.jpg Adrien Treuille Overview Image Types Pixel Filters Neighborhood
More informationINTRODUCTION TO COMPUTER SCIENCE I PROJECT 6 Sudoku! Revision 2 [2010-May-04] 1
INTRODUCTION TO COMPUTER SCIENCE I PROJECT 6 Sudoku! Revision 2 [2010-May-04] 1 1 The game of Sudoku Sudoku is a game that is currently quite popular and giving crossword puzzles a run for their money
More informationBinary Search Tree (Part 2 The AVL-tree)
Yufei Tao ITEE University of Queensland We ave already learned a static version of te BST. In tis lecture, we will make te structure dynamic, namely, allowing it to support updates (i.e., insertions and
More informationLECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR
1 LECTURE VI: LOSSLESS COMPRESSION ALGORITHMS DR. OUIEM BCHIR 2 STORAGE SPACE Uncompressed graphics, audio, and video data require substantial storage capacity. Storing uncompressed video is not possible
More informationMachine Translation - Decoding
January 15, 2007 Table of Contents 1 Introduction 2 3 4 5 6 Integer Programing Decoder 7 Experimental Results Word alignments Fertility Table Translation Table Heads Non-heads NULL-generated (ct.) Figure:
More informationCSS 343 Data Structures, Algorithms, and Discrete Math II. Balanced Search Trees. Yusuf Pisan
CSS 343 Data Structures, Algorithms, and Discrete Math II Balanced Search Trees Yusuf Pisan Height Height of a tree impacts how long it takes to find an item Balanced tree O(log n) vs Degenerate tree O(n)
More informationLecture 20: Combinatorial Search (1997) Steven Skiena. skiena
Lecture 20: Combinatorial Search (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Give an O(n lg k)-time algorithm
More informationImage Processing Computer Graphics I Lecture 20. Display Color Models Filters Dithering Image Compression
15-462 Computer Graphics I Lecture 2 Image Processing April 18, 22 Frank Pfenning Carnegie Mellon University http://www.cs.cmu.edu/~fp/courses/graphics/ Display Color Models Filters Dithering Image Compression
More informationAN ALTERNATIVE METHOD FOR ASSOCIATION RULES
AN ALTERNATIVE METHOD FOR ASSOCIATION RULES RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support (
More informationError-Correcting Codes
Error-Correcting Codes Information is stored and exchanged in the form of streams of characters from some alphabet. An alphabet is a finite set of symbols, such as the lower-case Roman alphabet {a,b,c,,z}.
More informationThe Use of Non-Local Means to Reduce Image Noise
The Use of Non-Local Means to Reduce Image Noise By Chimba Chundu, Danny Bin, and Jackelyn Ferman ABSTRACT Digital images, such as those produced from digital cameras, suffer from random noise that is
More informationCompressing Pattern Databases
Compressing Pattern Databases Ariel Felner and Ram Meshulam Computer Science Department Bar-Ilan University Ramat-Gan, Israel 92500 Email: ffelner,meshulr1g@cs.biu.ac.il Robert C. Holte Computing Science
More informationMEASURING PRIVACY RISK IN ONLINE SOCIAL NETWORKS. Justin Becker, Hao Chen UC Davis May 2009
MEASURING PRIVACY RISK IN ONLINE SOCIAL NETWORKS Justin Becker, Hao Chen UC Davis May 2009 1 Motivating example College admission Kaplan surveyed 320 admissions offices in 2008 1 in 10 admissions officers
More informationDigital Image Processing Lec.(3) 4 th class
Digital Image Processing Lec.(3) 4 th class Image Types The image types we will consider are: 1. Binary Images Binary images are the simplest type of images and can take on two values, typically black
More informationSequence Alignment & Computational Thinking
Sequence Alignment & Computational Thinking Michael Schatz Bioinformatics Lecture 2 Undergraduate Research Program 2011 Recap Sequence assays used for many important and interesting ways Variation Discovery:
More informationImage Searches, Abstraction, Invariance : Data Mining 2 September 2009
Image Searches, Abstraction, Invariance 36-350: Data Mining 2 September 2009 1 Medical: x-rays, brain imaging, histology ( do these look like cancerous cells? ) Satellite imagery Fingerprints Finding illustrations
More informationCollectives Pattern CS 472 Concurrent & Parallel Programming University of Evansville
Collectives Pattern CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science,
More informationRecommender Systems TIETS43 Collaborative Filtering
+ Recommender Systems TIETS43 Collaborative Filtering Fall 2017 Kostas Stefanidis kostas.stefanidis@uta.fi https://coursepages.uta.fi/tiets43/ selection Amazon generates 35% of their sales through recommendations
More informationGENERALIZATION: RANK ORDER FILTERS
GENERALIZATION: RANK ORDER FILTERS Definition For simplicity and implementation efficiency, we consider only brick (rectangular: wf x hf) filters. A brick rank order filter evaluates, for every pixel in
More informationFinal Practice Problems: Dynamic Programming and Max Flow Problems (I) Dynamic Programming Practice Problems
Final Practice Problems: Dynamic Programming and Max Flow Problems (I) Dynamic Programming Practice Problems To prepare for the final first of all study carefully all examples of Dynamic Programming which
More informationComputing Science (CMPUT) 496
Computing Science (CMPUT) 496 Search, Knowledge, and Simulations Martin Müller Department of Computing Science University of Alberta mmueller@ualberta.ca Winter 2017 Part I Intro - Problem Solving for
More informationDigital Image Processing. Lecture # 6 Corner Detection & Color Processing
Digital Image Processing Lecture # 6 Corner Detection & Color Processing 1 Corners Corners (interest points) Unlike edges, corners (patches of pixels surrounding the corner) do not necessarily correspond
More informationCMPUT 657: Heuristic Search
CMPUT 657: Heuristic Search Assignment 1: Two-player Search Summary You are to write a program to play the game of Lose Checkers. There are two goals for this assignment. First, you want to build the smallest
More informationMITOCW watch?v=xsgorvw8j6q
MITOCW watch?v=xsgorvw8j6q The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To
More informationSynthesis of Combinational Logic
Synthesis of ombinational Logic 6.4 Gates F = xor Handouts: Lecture Slides, PS3, Lab2 6.4 - Spring 2 2/2/ L5 Logic Synthesis Review: K-map Minimization ) opy truth table into K-Map 2) Identify subcubes,
More informationSpring 06 Assignment 2: Constraint Satisfaction Problems
15-381 Spring 06 Assignment 2: Constraint Satisfaction Problems Questions to Vaibhav Mehta(vaibhav@cs.cmu.edu) Out: 2/07/06 Due: 2/21/06 Name: Andrew ID: Please turn in your answers on this assignment
More informationChapter 7: Sorting 7.1. Original
Chapter 7: Sorting 7.1 Original 3 1 4 1 5 9 2 6 5 after P=2 1 3 4 1 5 9 2 6 5 after P=3 1 3 4 1 5 9 2 6 5 after P=4 1 1 3 4 5 9 2 6 5 after P=5 1 1 3 4 5 9 2 6 5 after P=6 1 1 3 4 5 9 2 6 5 after P=7 1
More informationA method and a tool for geocoding and record linkage
WORKING PAPERS A method and a tool for geocoding and record linkage Omar CHARIF 1 Hichem OMRANI 1 Olivier KLEIN 1 Marc SCHNEIDER 1 Philippe TRIGANO 2 CEPS/INSTEAD, Luxembourg 1 Heudiasyc Laboratory, Technology
More informationMesh density options. Rigidity mode options. Transform expansion. Pin depth options. Set pin rotation. Remove all pins button.
Martin Evening Adobe Photoshop CS5 for Photographers Including soft edges The Puppet Warp mesh is mostly applied to all of the selected layer contents, including the semi-transparent edges, even if only
More informationSoutheastern European Regional Programming Contest Bucharest, Romania Vinnytsya, Ukraine October 21, Problem A Concerts
Problem A Concerts File: A.in File: standard output Time Limit: 0.3 seconds (C/C++) Memory Limit: 128 megabytes John enjoys listening to several bands, which we shall denote using A through Z. He wants
More informationFall 2015 COMP Operating Systems. Lab #7
Fall 2015 COMP 3511 Operating Systems Lab #7 Outline Review and examples on virtual memory Motivation of Virtual Memory Demand Paging Page Replacement Q. 1 What is required to support dynamic memory allocation
More informationVoice Activity Detection
Voice Activity Detection Speech Processing Tom Bäckström Aalto University October 2015 Introduction Voice activity detection (VAD) (or speech activity detection, or speech detection) refers to a class
More informationConfiguring OSPF. Information About OSPF CHAPTER
CHAPTER 22 This chapter describes how to configure the ASASM to route data, perform authentication, and redistribute routing information using the Open Shortest Path First (OSPF) routing protocol. The
More informationGraphs and Network Flows IE411. Lecture 14. Dr. Ted Ralphs
Graphs and Network Flows IE411 Lecture 14 Dr. Ted Ralphs IE411 Lecture 14 1 Review: Labeling Algorithm Pros Guaranteed to solve any max flow problem with integral arc capacities Provides constructive tool
More informationAI Approaches to Ultimate Tic-Tac-Toe
AI Approaches to Ultimate Tic-Tac-Toe Eytan Lifshitz CS Department Hebrew University of Jerusalem, Israel David Tsurel CS Department Hebrew University of Jerusalem, Israel I. INTRODUCTION This report is
More informationReal Time Word to Picture Translation for Chinese Restaurant Menus
Real Time Word to Picture Translation for Chinese Restaurant Menus Michelle Jin, Ling Xiao Wang, Boyang Zhang Email: mzjin12, lx2wang, boyangz @stanford.edu EE268 Project Report, Spring 2014 Abstract--We
More informationIntroduction to Markov Models
Introduction to Markov Models But first: A few preliminaries Estimating the probability of phrases of words, sentences, etc. CIS 391 - Intro to AI 2 What counts as a word? A tricky question. How to find
More informationLecture 4: n-grams in NLP. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han
Lecture 4: n-grams in NLP LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han Objectives Frequent n-grams in English n-grams and statistical NLP n-grams and conditional probability Large
More informationMethod for Real Time Text Extraction of Digital Manga Comic
Method for Real Time Text Extraction of Digital Manga Comic Kohei Arai Information Science Department Saga University Saga, 840-0027, Japan Herman Tolle Software Engineering Department Brawijaya University
More informationgame tree complete all possible moves
Game Trees Game Tree A game tree is a tree the nodes of which are positions in a game and edges are moves. The complete game tree for a game is the game tree starting at the initial position and containing
More informationClassification of Road Images for Lane Detection
Classification of Road Images for Lane Detection Mingyu Kim minkyu89@stanford.edu Insun Jang insunj@stanford.edu Eunmo Yang eyang89@stanford.edu 1. Introduction In the research on autonomous car, it is
More informationImage Searches, Abstraction, Invariance : Data Mining 8 September 2008
Image Searches, Abstraction, Invariance 36-350: Data Mining 8 September 2008 1 Medical: x-rays, brain imaging, histology ( do these look like cancerous cells? ) Satellite imagery Fingerprints Finding illustrations
More information6.02 Introduction to EECS II Spring Quiz 1
M A S S A C H U S E T T S I N S T I T U T E O F T E C H N O L O G Y DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE 6.02 Introduction to EECS II Spring 2011 Quiz 1 Name SOLUTIONS Score Please
More informationG51PGP: Software Paradigms. Object Oriented Coursework 4
G51PGP: Software Paradigms Object Oriented Coursework 4 You must complete this coursework on your own, rather than working with anybody else. To complete the coursework you must create a working two-player
More informationDistinguishing Photographs and Graphics on the World Wide Web
Distinguishing Photographs and Graphics on the World Wide Web Vassilis Athitsos, Michael J. Swain and Charles Frankel Department of Computer Science The University of Chicago Chicago, Illinois 60637 vassilis,
More informationIntroduction to Source Coding
Comm. 52: Communication Theory Lecture 7 Introduction to Source Coding - Requirements of source codes - Huffman Code Length Fixed Length Variable Length Source Code Properties Uniquely Decodable allow
More informationHeuristic Search with Pre-Computed Databases
Heuristic Search with Pre-Computed Databases Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Abstract Use pre-computed partial results to improve the efficiency of heuristic
More informationThe Message Passing Interface (MPI)
The Message Passing Interface (MPI) MPI is a message passing library standard which can be used in conjunction with conventional programming languages such as C, C++ or Fortran. MPI is based on the point-to-point
More informationDeep Green. System for real-time tracking and playing the board game Reversi. Final Project Submitted by: Nadav Erell
Deep Green System for real-time tracking and playing the board game Reversi Final Project Submitted by: Nadav Erell Introduction to Computational and Biological Vision Department of Computer Science, Ben-Gurion
More informationHuffman Coding - A Greedy Algorithm. Slides based on Kevin Wayne / Pearson-Addison Wesley
- A Greedy Algorithm Slides based on Kevin Wayne / Pearson-Addison Wesley Greedy Algorithms Greedy Algorithms Build up solutions in small steps Make local decisions Previous decisions are never reconsidered
More informationIntroduction. Physics 1CL WAVES AND SOUND FALL 2009
Introduction This lab and the next are based on the physics of waves and sound. In this lab, transverse waves on a string and both transverse and longitudinal waves on a slinky are studied. To describe
More informationLink State Routing. Stefano Vissicchio UCL Computer Science CS 3035/GZ01
Link State Routing Stefano Vissicchio UCL Computer Science CS 335/GZ Reminder: Intra-domain Routing Problem Shortest paths problem: What path between two vertices offers minimal sum of edge weights? Classic
More informationConvolutional neural networks
Convolutional neural networks Themes Curriculum: Ch 9.1, 9.2 and http://cs231n.github.io/convolutionalnetworks/ The simple motivation and idea How it s done Receptive field Pooling Dilated convolutions
More informationDesign and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute. Module 6 Lecture - 37 Divide and Conquer: Counting Inversions
Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute Module 6 Lecture - 37 Divide and Conquer: Counting Inversions Let us go back and look at Divide and Conquer again.
More informationMITOCW R3. Document Distance, Insertion and Merge Sort
MITOCW R3. Document Distance, Insertion and Merge Sort The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational
More information신경망기반자동번역기술. Konkuk University Computational Intelligence Lab. 김강일
신경망기반자동번역기술 Konkuk University Computational Intelligence Lab. http://ci.konkuk.ac.kr kikim01@kunkuk.ac.kr 김강일 Index Issues in AI and Deep Learning Overview of Machine Translation Advanced Techniques in
More informationPast questions from the last 6 years of exams for programming 101 with answers.
1 Past questions from the last 6 years of exams for programming 101 with answers. 1. Describe bubble sort algorithm. How does it detect when the sequence is sorted and no further work is required? Bubble
More informationLeaf-Value Tables for Pruning Non-Zero-Sum Games
Leaf-Value Tables for Pruning Non-Zero-Sum Games Nathan Sturtevant University of Alberta Department of Computing Science Edmonton, AB Canada T6G 2E8 nathanst@cs.ualberta.ca Abstract Algorithms for pruning
More informationCS 7643: Deep Learning
CS 7643: Deep Learning Topics: Toeplitz matrices and convolutions = matrix-mult Dilated/a-trous convolutions Backprop in conv layers Transposed convolutions Dhruv Batra Georgia Tech HW1 extension 09/22
More informationMITOCW ocw lec11
MITOCW ocw-6.046-lec11 Here 2. Good morning. Today we're going to talk about augmenting data structures. That one is 23 and that is 23. And I look here. For this one, And this is a -- Normally, rather
More informationQuestion Score Max Cover Total 149
CS170 Final Examination 16 May 20 NAME (1 pt): TA (1 pt): Name of Neighbor to your left (1 pt): Name of Neighbor to your right (1 pt): This is a closed book, closed calculator, closed computer, closed
More informationComputing for Engineers in Python
Computing for Engineers in Python Lecture 10: Signal (Image) Processing Autumn 2011-12 Some slides incorporated from Benny Chor s course 1 Lecture 9: Highlights Sorting, searching and time complexity Preprocessing
More informationCentral Place Indexing: Optimal Location Representation for Digital Earth. Kevin M. Sahr Department of Computer Science Southern Oregon University
Central Place Indexing: Optimal Location Representation for Digital Earth Kevin M. Sahr Department of Computer Science Southern Oregon University 1 Kevin Sahr - October 6, 2014 The Situation Geospatial
More informationExploring variations through computational analysis. Alan Marsden, Lancaster University, UK
Exploring variations through computational analysis Alan Marsden, Lancaster University, UK Possibilities of using computation Using computers changes or even challenges the practices of music analysis.
More informationDeriving Consistency from LEGOs
Deriving Consistency from LEGOs What we have learned in 6 years of FLL and 7 years of Lego Robotics by Austin and Travis Schuh 1 2006 Austin and Travis Schuh, all rights reserved Objectives Basic Building
More information